Project

General

Profile

Actions

Bug #9016

closed

String#encoding is lying?

Added by renatosilva (Renato Silva) over 10 years ago. Updated about 3 years ago.

Status:
Closed
Assignee:
Target version:
-
ruby -v:
ruby 2.0.0p247 (2013-06-27) [i386-mingw32]
Backport:
[ruby-core:57830]

Description

Please see attached test case.

If you try opening a file using a CP850 (possibly others) path which was passed as command line argument, you are not able at all, unless you encode the argument into its very own reported encoding (CP850), and from some encoding different than that (in my case, both ISO-8859-1 and Windows-1252 worked). It is just like ARGV[0].encoding is lying!

Before, in Ruby 1.8, File.open would work just fine. I have a script that just stopped working, till I found the above workaround. This seems to me like a bug. I would expect Ruby to just do its best in order to convert user input into the required encodings for file APIs and such. Meaning I would not like for a possible fix to require any code migration from 1.8 to 1.9+ at all.


Files

encoding-lying.rb (1.04 KB) encoding-lying.rb renatosilva (Renato Silva), 10/12/2013 04:02 PM
encoding-lying-reduced.rb (761 Bytes) encoding-lying-reduced.rb renatosilva (Renato Silva), 10/14/2013 06:57 PM

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

  • Status changed from Open to Feedback
  • Assignee set to windows

I know nothing about CP850, give a concrete example path name to reproduce it.

Updated by renatosilva (Renato Silva) over 10 years ago

If you type "chcp 850" in cmd.exe before calling the script, it should accept the argument. You can use the word "Japonês" (Japanese) as example for the file path.

Updated by renatosilva (Renato Silva) over 10 years ago

This reduced test case shows that the argument looks like an ISO-8859-1 string even though its encoding is reported as CP850.

Updated by nobu (Nobuyoshi Nakada) over 10 years ago

It would vary on system code pages.

What do you expect and what did you get?

Updated by renatosilva (Renato Silva) over 10 years ago

I would expect that if ARGV[0].encoding is CP850, then the string is encoded as CP850. Instead, the string is encoded in another encoding, ISO-8859-1. The reduced test case should output this:

Encoding of argument is reported as CP850 and as valid.
Let us inspect the a-tilde argument: "\xE3"
Let us inspect the a-tilde from UTF-8 source code transcoded into CP850: "\xC6"
Let us inspect the a-tilde from UTF-8 source code transcoded into ISO-8859-1: "\xE3"
RESULT: as you can see, the argument looks like an ISO-8859-1 string, but reports its encoding as CP850.

Updated by jeremyevans0 (Jeremy Evans) about 3 years ago

  • Status changed from Feedback to Closed
  • Backport deleted (1.9.3: UNKNOWN, 2.0.0: UNKNOWN)

As Ruby 3.0 uses UTF-8 for ARGV, this is fixed.

With modified example:

puts "Encoding of argument is reported as #{ARGV[0].encoding} and as #{ARGV[0].valid_encoding? ? "valid" : "invalid"}."
puts "Let us inspect the a-tilde argument: #{ARGV[0].dump}"
puts "Let us inspect the a-tilde from UTF-8 source code transcoded into CP850: #{"ã".encode("CP850").dump}"
puts "Let us inspect the a-tilde from UTF-8 source code transcoded into ISO-8859-1: #{"ã".encode("ISO-8859-1").dump}"

output is:

ruby t.rb ã
Encoding of argument is reported as UTF-8 and as valid.
Let us inspect the a-tilde argument: "\u00E3"
Let us inspect the a-tilde from UTF-8 source code transcoded into CP850: "\xC6"
Let us inspect the a-tilde from UTF-8 source code transcoded into ISO-8859-1: "\xE3"
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0