Project

General

Profile

Actions

Bug #5855

closed

inconsistent treatment of 8 bit characters in US-ASCII

Added by john_firebaugh (John Firebaugh) almost 13 years ago. Updated almost 13 years ago.

Status:
Closed
Target version:
-
ruby -v:
Backport:
[ruby-core:41949]

Description

=begin
Does Ruby allow 8 bit characters (127-255) in a US-ASCII encoded string, or not?

"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError
0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)
=end


Related issues 2 (0 open2 closed)

Related to Ruby master - Bug #5863: Integer#chr may return a string with multiple charactersClosed01/08/2012Actions
Related to Ruby master - Bug #5864: Integer#chr raises on some invalid codepoints but returns an invalidly-encoded string for othersClosed01/08/2012Actions

Updated by naruse (Yui NARUSE) almost 13 years ago

  • Status changed from Open to Rejected

U+0080 of Unicode can't be mapped to 0x80 of US-ASCII.
In US-ASCII, the codepoint 0x80 exists, but doesn't define any character.

Updated by john_firebaugh (John Firebaugh) almost 13 years ago

Unless MRI has some non-standard definition of the term "codepoint", your second statement is incorrect. In US-ASCII, the codepoint 0x80 does not exist.

IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.

Actions #3

Updated by naruse (Yui NARUSE) almost 13 years ago

  • Tracker changed from Bug to Feature
  • Status changed from Rejected to Assigned
  • Assignee set to naruse (Yui NARUSE)

IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or
promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.

In other words,

"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError

For exapmle \u00A3, Pound Sign, US-ASCII clearly doesn't include it.
So it must Encoding::UndefinedConversionError.

0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)

In Ruby, a string is an 8 bit byte string.
So US-ASCII, 7 bit string, lives as 8bit string in Ruby.
So there is 0x80 even if it is invalid string.

"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)

Maybe both of them should be ASCII-8BIT.

Updated by john_firebaugh (John Firebaugh) almost 13 years ago

=begin

Maybe both of them should be ASCII-8BIT.

I would prefer not, as then String#<< with an Integer ((|i|)) can't be defined as (({self << i.chr(self.encoding)})).

I think it would make much more sense for (({"".encode("US-ASCII") << 128})) and (({128.chr("US-ASCII")})) both to raise RangeError. The current behavior is just weird:

a = "".encode("US-ASCII") << 128
b = 128.chr("US-ASCII")
a == b #=> true
a.valid_encoding? #=> true
b.valid_encoding? #=> false

=end

Actions #5

Updated by naruse (Yui NARUSE) almost 13 years ago

  • Tracker changed from Feature to Bug
Actions #6

Updated by naruse (Yui NARUSE) almost 13 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r34236.
John, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • numeric.c (rb_enc_uint_char): raise RangeError when added codepoint
    is invalid. [Feature #5855] [Bug #5863] [Bug #5864]

  • string.c (rb_str_concat): ditto.

  • string.c (rb_str_concat): set encoding as ASCII-8BIT when the string
    is US-ASCII and the argument is an integer greater than 127.

  • regenc.c (onigenc_mb2_code_to_mbclen): rearrange error code.

  • enc/euc_jp.c (code_to_mbclen): ditto.

  • enc/shift_jis.c (code_to_mbclen): ditto.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0