Bug #9016
closedString#encoding is lying?
Description
Please see attached test case.
If you try opening a file using a CP850 (possibly others) path which was passed as command line argument, you are not able at all, unless you encode the argument into its very own reported encoding (CP850), and from some encoding different than that (in my case, both ISO-8859-1 and Windows-1252 worked). It is just like ARGV[0].encoding is lying!
Before, in Ruby 1.8, File.open would work just fine. I have a script that just stopped working, till I found the above workaround. This seems to me like a bug. I would expect Ruby to just do its best in order to convert user input into the required encodings for file APIs and such. Meaning I would not like for a possible fix to require any code migration from 1.8 to 1.9+ at all.
Files
Updated by nobu (Nobuyoshi Nakada) about 11 years ago
- Status changed from Open to Feedback
- Assignee set to windows
I know nothing about CP850, give a concrete example path name to reproduce it.
Updated by renatosilva (Renato Silva) about 11 years ago
If you type "chcp 850" in cmd.exe before calling the script, it should accept the argument. You can use the word "Japonês" (Japanese) as example for the file path.
Updated by renatosilva (Renato Silva) about 11 years ago
This reduced test case shows that the argument looks like an ISO-8859-1 string even though its encoding is reported as CP850.
Updated by nobu (Nobuyoshi Nakada) about 11 years ago
It would vary on system code pages.
What do you expect and what did you get?
Updated by renatosilva (Renato Silva) about 11 years ago
I would expect that if ARGV[0].encoding is CP850, then the string is encoded as CP850. Instead, the string is encoded in another encoding, ISO-8859-1. The reduced test case should output this:
Encoding of argument is reported as CP850 and as valid.
Let us inspect the a-tilde argument: "\xE3"
Let us inspect the a-tilde from UTF-8 source code transcoded into CP850: "\xC6"
Let us inspect the a-tilde from UTF-8 source code transcoded into ISO-8859-1: "\xE3"
RESULT: as you can see, the argument looks like an ISO-8859-1 string, but reports its encoding as CP850.
Updated by jeremyevans0 (Jeremy Evans) almost 4 years ago
- Status changed from Feedback to Closed
- Backport deleted (
1.9.3: UNKNOWN, 2.0.0: UNKNOWN)
As Ruby 3.0 uses UTF-8 for ARGV, this is fixed.
With modified example:
puts "Encoding of argument is reported as #{ARGV[0].encoding} and as #{ARGV[0].valid_encoding? ? "valid" : "invalid"}."
puts "Let us inspect the a-tilde argument: #{ARGV[0].dump}"
puts "Let us inspect the a-tilde from UTF-8 source code transcoded into CP850: #{"ã".encode("CP850").dump}"
puts "Let us inspect the a-tilde from UTF-8 source code transcoded into ISO-8859-1: #{"ã".encode("ISO-8859-1").dump}"
output is:
ruby t.rb ã
Encoding of argument is reported as UTF-8 and as valid.
Let us inspect the a-tilde argument: "\u00E3"
Let us inspect the a-tilde from UTF-8 source code transcoded into CP850: "\xC6"
Let us inspect the a-tilde from UTF-8 source code transcoded into ISO-8859-1: "\xE3"