Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #8630

closed

Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef

Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef

Added by headius (Charles Nutter) over 12 years ago. Updated over 12 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

2.0.0

Backport:

1.9.3: UNKNOWN, 2.0.0: UNKNOWN

[ruby-core:55984]

Description

When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:

"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError

This can be disabled by passing :undef => :replace as an option to the encode call.

I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.

The error raised should be InvalidByteSequenceError and it should be prevented by using :invalid => :replace option.

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#1 [ruby-core:55986]

2013/7/13 headius (Charles Nutter) headius@headius.com:

Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef
https://bugs.ruby-lang.org/issues/8630

When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:

"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError

This can be disabled by passing :undef => :replace as an option to the encode call.

I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.

No.

ASCII-8BIT consists 128 ASCII characters and 128 special characters to
represent 0x80 to 0xff binary bytes.
The special characters are not representable in UTF-8.
So UndefinedConversionError is raised.

The validity of a characetr is defined by encoding, not transcoding.¶

Tanaka Akira

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#2 [ruby-core:55992]

Status changed from Open to Rejected

Updated by duerst (Martin Dürst) over 12 years ago Actions
Copy link
#3 [ruby-core:55994]

Hello Charles,

On 2013/07/13 6:26, Tanaka Akira wrote:

2013/7/13 headius (Charles Nutter)headius@headius.com:

Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef
https://bugs.ruby-lang.org/issues/8630

When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:

"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError

I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.

No.

I fully agree.

ASCII-8BIT consists 128 ASCII characters and 128 special characters to
represent 0x80 to 0xff binary bytes.

That's one way to put it, but a better way is to say that ASCII-8BIT
consists of 128 ASCII characters and 128 unassigned codepoints. This is
similar to unassigned codepoints in UTF-8.

The special characters are not representable in UTF-8.
So UndefinedConversionError is raised.

The validity of a characetr is defined by encoding, not transcoding.

Yes. Valid means that the original data as is is valid, nothing more. It
does not depend on the target encoding. And ASCII-8BIT of course can
contain bytes 0x80 and beyond, that's its job.

Regards, Martin.

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#4 [ruby-core:55995]

2013/7/13 "Martin J. Dürst" duerst@it.aoyama.ac.jp:

That's one way to put it, but a better way is to say that ASCII-8BIT
consists of 128 ASCII characters and 128 unassigned codepoints. This is
similar to unassigned codepoints in UTF-8.

Your interpretation forbids us to convert binary between encodings.

For example, Emacs has charsets for binary such as eight-bit-control or
eight-bit-graphic (or eight-bit? I'm not familier with recent Emacs).

If we support a encoding which supports them and ASCII, we can convert
binary string between the encoding and ASCII-8BIT.

In your interpretation, such conversion would raise
UndefinedConversionError because unassigned codepoints can't have
character mapping for another encoding.¶

Tanaka Akira

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #8630

Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#1 [ruby-core:55986]

The validity of a characetr is defined by encoding, not transcoding.¶

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#2 [ruby-core:55992]

Updated by duerst (Martin Dürst) over 12 years ago Actions
Copy link
#3 [ruby-core:55994]

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#4 [ruby-core:55995]

In your interpretation, such conversion would raise
UndefinedConversionError because unassigned codepoints can't have
character mapping for another encoding.¶

Project

General

Profile

Ruby

Custom queries

Bug #8630

Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef

Updated by akr (Akira Tanaka) over 12 years ago ActionsCopy link #1 [ruby-core:55986]

The validity of a characetr is defined by encoding, not transcoding.¶

Updated by nobu (Nobuyoshi Nakada) over 12 years ago ActionsCopy link #2 [ruby-core:55992]

Updated by duerst (Martin Dürst) over 12 years ago ActionsCopy link #3 [ruby-core:55994]

Updated by akr (Akira Tanaka) over 12 years ago ActionsCopy link #4 [ruby-core:55995]

In your interpretation, such conversion would raise UndefinedConversionError because unassigned codepoints can't have character mapping for another encoding.¶

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#1 [ruby-core:55986]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago Actions
Copy link
#2 [ruby-core:55992]

Updated by duerst (Martin Dürst) over 12 years ago Actions
Copy link
#3 [ruby-core:55994]

Updated by akr (Akira Tanaka) over 12 years ago Actions
Copy link
#4 [ruby-core:55995]

In your interpretation, such conversion would raise
UndefinedConversionError because unassigned codepoints can't have
character mapping for another encoding.¶