Bug #5855
closedinconsistent treatment of 8 bit characters in US-ASCII
Description
=begin
Does Ruby allow 8 bit characters (127-255) in a US-ASCII encoded string, or not?
"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError
0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)
=end
Updated by naruse (Yui NARUSE) almost 13 years ago
- Status changed from Open to Rejected
U+0080 of Unicode can't be mapped to 0x80 of US-ASCII.
In US-ASCII, the codepoint 0x80 exists, but doesn't define any character.
Updated by john_firebaugh (John Firebaugh) almost 13 years ago
Unless MRI has some non-standard definition of the term "codepoint", your second statement is incorrect. In US-ASCII, the codepoint 0x80 does not exist.
IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.
Updated by naruse (Yui NARUSE) almost 13 years ago
- Tracker changed from Bug to Feature
- Status changed from Rejected to Assigned
- Assignee set to naruse (Yui NARUSE)
IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or
promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.
In other words,
"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError
For exapmle \u00A3, Pound Sign, US-ASCII clearly doesn't include it.
So it must Encoding::UndefinedConversionError.
0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)
In Ruby, a string is an 8 bit byte string.
So US-ASCII, 7 bit string, lives as 8bit string in Ruby.
So there is 0x80 even if it is invalid string.
"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)
Maybe both of them should be ASCII-8BIT.
Updated by john_firebaugh (John Firebaugh) almost 13 years ago
=begin
Maybe both of them should be ASCII-8BIT.
I would prefer not, as then String#<< with an Integer ((|i|)) can't be defined as (({self << i.chr(self.encoding)})).
I think it would make much more sense for (({"".encode("US-ASCII") << 128})) and (({128.chr("US-ASCII")})) both to raise RangeError. The current behavior is just weird:
a = "".encode("US-ASCII") << 128
b = 128.chr("US-ASCII")
a == b #=> true
a.valid_encoding? #=> true
b.valid_encoding? #=> false
=end
Updated by naruse (Yui NARUSE) almost 13 years ago
- Tracker changed from Feature to Bug
Updated by naruse (Yui NARUSE) almost 13 years ago
- Status changed from Assigned to Closed
- % Done changed from 0 to 100
This issue was solved with changeset r34236.
John, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.
-
numeric.c (rb_enc_uint_char): raise RangeError when added codepoint
is invalid. [Feature #5855] [Bug #5863] [Bug #5864] -
string.c (rb_str_concat): ditto.
-
string.c (rb_str_concat): set encoding as ASCII-8BIT when the string
is US-ASCII and the argument is an integer greater than 127. -
regenc.c (onigenc_mb2_code_to_mbclen): rearrange error code.
-
enc/euc_jp.c (code_to_mbclen): ditto.
-
enc/shift_jis.c (code_to_mbclen): ditto.