Bug #15497
closedEncoding of error messages should not depend on the locale encoding
Description
This seems to happen mostly for internal errors, as raise
in Ruby code of course just uses the passed String's encoding for the message.
Example:
name = "été"
p name.encoding
begin
Module.new.const_set(name, 1)
rescue => e
p e
p e.message.encoding
end
When run, it gives:
$ LANG=en_US.UTF-8 ruby c.rb
#<Encoding:UTF-8>
#<NameError: wrong constant name été>
#<Encoding:UTF-8>
$ LANG=C ruby c.rb
#<Encoding:UTF-8>
#<NameError: wrong constant name "\u00E9t\u00E9">
#<Encoding:US-ASCII>
Depending on the locale encoding, the encoding of the message changes!
This seems very unexpected, is inconvenient for testing (e.g., https://github.com/ruby/spec/commit/a6101a6e and any test checking exception messages with non-US-ASCII characters),
and does not represent what is in the source code (here it's clearly a valid UTF-8 String).
I think for such a case, the encoding of the constant name should be used, i.e., UTF-8.
Another way to see it is the message should be built like "wrong constant name ".force_encoding('us-ascii') + constant_name
.
Indeed, if we do build the message manually like that it works as expected:
name = "été"
begin
raise "wrong constant name ".force_encoding('US-ASCII') + name
rescue => e
p e
p e.message.encoding
end
gives
$ LANG=en_US.UTF-8 ruby c.rb
#<Encoding:UTF-8>
#<RuntimeError: wrong constant name été>
#<Encoding:UTF-8>
$ LANG=C ruby c.rb
#<Encoding:UTF-8>
#<RuntimeError: wrong constant name \u00E9t\u00E9>
#<Encoding:UTF-8>
Note that the message still looks different, but that's the effect of Kernel#p
, because it does not know how to display UTF-8 characters in a US-ASCII terminal.
Nevertheless, both messages have the same bytes and encoding, which fixes all 3 problems mentioned above.
Setting Encoding.default_internal
can workaround this but it's a bad workaround as this cannot work reliably in a multithreaded Ruby application,
affects many more things than just error messages, and the default behavior should be error messages with a deterministic encoding, just like raise
in Ruby code.