Project

General

Profile

Actions

Feature #1106

closed

Script encoding vs. default_internal: Implicitly transcode strings/regexps

Added by tomel (Tom Link) about 15 years ago. Updated almost 13 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
[ruby-core:21830]

Description

=begin
If I'm not mistaken, a related issue was discussed in the past (eg [1]). Anyway, please take a sec and consider the following scripts and input files:

FILE: test2.rb:

encoding: UTF-8

Encoding.default_internal = Encoding::UTF_8
Encoding.default_external = Encoding::UTF_8
require 'test2a'
File.readlines('test2.txt').each do |line|
p line, test2a(line)
end

FILE: test2a.rb

encoding: ISO-8859-1

p ENCODING
def test2a(x)
x =~ /[äöüÄÖÜß]/
end

FILE: test.txt (uft8 byte sequences; the second line should read "weiß", the third one "Bär" in UTF-8 encoding)
foo
weiß
Bär
bar

If I run

$ ruby -v
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-cygwin]
$ ruby test2.rb
#Encoding:ISO-8859-1
"foo\n"
nil
/home/t/src/tmp/test2a.rb:6:in test2a': invalid byte sequence in UTF-8 (ArgumentError) from test2.rb:9:in block in '
from test2.rb:8:in each' from test2.rb:8:in '

It seems the ISO-8859-1 encoded regexp in test2a.rb /[äöüÄÖÜß]/, is not transcoded to UTF-8. But since default_internal is set to UFT-8, ruby seems to expect a valid UTF-8 string. Please forgive me if my interpretation of that error message is wrong. It is quite possible that I missed something and that there already exists an easy solution to this problem, which I don't know of. If that is the case, I kindly ask you to tell me about it.

If this is the way, ruby 1.9.1 currently is supposed to work, I would humbly suggest to silently transcode all strings found in scripts to default_internal if non-nil. IMHO not transcoding strings doesn't make any sense and drives users who work with heterogeneous files to madness. If a string cannot be transcoded to default_internal, an error should be raised. Thanks.

[1] http://groups.google.com/group/ruby-core-google/browse_frm/thread/d6474429dd112926?hl=en
=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0