Bug #5855: inconsistent treatment of 8 bit characters in US-ASCII - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #5855

closed

inconsistent treatment of 8 bit characters in US-ASCII

Added by john_firebaugh (John Firebaugh) over 13 years ago. Updated over 13 years ago.

Status:

Closed

Assignee:

naruse (Yui NARUSE)

Target version:

ruby -v:

Backport:

[ruby-core:41949]

Description

=begin
Does Ruby allow 8 bit characters (127-255) in a US-ASCII encoded string, or not?

"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError
0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)
=end

Related issues 2 (0 open — 2 closed)

Actions

Copy link

#1 [ruby-core:41956]

Updated by naruse (Yui NARUSE) over 13 years ago

Status changed from Open to Rejected

U+0080 of Unicode can't be mapped to 0x80 of US-ASCII.
In US-ASCII, the codepoint 0x80 exists, but doesn't define any character.

Actions

Copy link

#2 [ruby-core:41966]

Updated by john_firebaugh (John Firebaugh) over 13 years ago

Unless MRI has some non-standard definition of the term "codepoint", your second statement is incorrect. In US-ASCII, the codepoint 0x80 does not exist.

IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.

Actions

Copy link

Updated by naruse (Yui NARUSE) over 13 years ago

Tracker changed from Bug to Feature
Status changed from Rejected to Assigned
Assignee set to naruse (Yui NARUSE)

IMO, any operation that attempts to produce a US-ASCII string containing 0x80 should either fail (like "\u{80}".encode("US-ASCII")) or
promote to ASCII-8BIT (like "".encode("US-ASCII") << 128.chr). So I believe the middle two examples are incorrect.

In other words,

"\u{80}".encode("US-ASCII") #=> Encoding::UndefinedConversionError

For exapmle \u00A3, Pound Sign, US-ASCII clearly doesn't include it.
So it must Encoding::UndefinedConversionError.

0x80.chr("US-ASCII") #=> "\x80" (US-ASCII encoding)

In Ruby, a string is an 8 bit byte string.
So US-ASCII, 7 bit string, lives as 8bit string in Ruby.
So there is 0x80 even if it is invalid string.

"".encode("US-ASCII") << 128 #=> "\x80" (US-ASCII encoding)
"".encode("US-ASCII") << 128.chr #=> "\x80" (ASCII-8BIT encoding)

Maybe both of them should be ASCII-8BIT.

Actions

Copy link

#4 [ruby-core:41974]

Updated by john_firebaugh (John Firebaugh) over 13 years ago

=begin

Maybe both of them should be ASCII-8BIT.

I would prefer not, as then String#<< with an Integer ((|i|)) can't be defined as (({self << i.chr(self.encoding)})).

I think it would make much more sense for (({"".encode("US-ASCII") << 128})) and (({128.chr("US-ASCII")})) both to raise RangeError. The current behavior is just weird:

a = "".encode("US-ASCII") << 128
b = 128.chr("US-ASCII")
a == b #=> true
a.valid_encoding? #=> true
b.valid_encoding? #=> false

=end

Actions

Copy link

Updated by naruse (Yui NARUSE) over 13 years ago

Tracker changed from Feature to Bug

Actions

Copy link

Updated by naruse (Yui NARUSE) over 13 years ago

Status changed from Assigned to Closed
% Done changed from 0 to 100

This issue was solved with changeset r34236.
John, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

numeric.c (rb_enc_uint_char): raise RangeError when added codepoint
is invalid. [Feature #5855] [Bug #5863] [Bug #5864]
string.c (rb_str_concat): ditto.
string.c (rb_str_concat): set encoding as ASCII-8BIT when the string
is US-ASCII and the argument is an integer greater than 127.
regenc.c (onigenc_mb2_code_to_mbclen): rearrange error code.
enc/euc_jp.c (code_to_mbclen): ditto.
enc/shift_jis.c (code_to_mbclen): ditto.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #5855

inconsistent treatment of 8 bit characters in US-ASCII

Updated by naruse (Yui NARUSE) over 13 years ago

Updated by john_firebaugh (John Firebaugh) over 13 years ago

Updated by naruse (Yui NARUSE) over 13 years ago

Updated by john_firebaugh (John Firebaugh) over 13 years ago

Updated by naruse (Yui NARUSE) over 13 years ago

Updated by naruse (Yui NARUSE) over 13 years ago

	Related to Ruby - Bug #5863: Integer#chr may return a string with multiple characters	Closed		01/08/2012			Actions
	Related to Ruby - Bug #5864: Integer#chr raises on some invalid codepoints but returns an invalidly-encoded string for others	Closed		01/08/2012			Actions