Bug #8630

Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef

Added by Charles Nutter 9 months ago. Updated 9 months ago.

[ruby-core:55984]
Status:Rejected
Priority:Normal
Assignee:-
Category:-
Target version:-
ruby -v:2.0.0 Backport:1.9.3: UNKNOWN, 2.0.0: UNKNOWN

Description

When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:

"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError

This can be disabled by passing :undef => :replace as an option to the encode call.

I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.

The error raised should be InvalidByteSequenceError and it should be prevented by using :invalid => :replace option.

History

#1 Updated by Akira Tanaka 9 months ago

2013/7/13 headius (Charles Nutter) headius@headius.com:

Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef
https://bugs.ruby-lang.org/issues/8630

When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:

"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError

This can be disabled by passing :undef => :replace as an option to the encode call.

I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.

No.

ASCII-8BIT consists 128 ASCII characters and 128 special characters to
represent 0x80 to 0xff binary bytes.
The special characters are not representable in UTF-8.
So UndefinedConversionError is raised.

The validity of a characetr is defined by encoding, not transcoding.
--
Tanaka Akira

#2 Updated by Nobuyoshi Nakada 9 months ago

  • Status changed from Open to Rejected

#3 Updated by Martin Dürst 9 months ago

Hello Charles,

On 2013/07/13 6:26, Tanaka Akira wrote:

2013/7/13 headius (Charles Nutter)headius@headius.com:

Bug #8630: Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef
https://bugs.ruby-lang.org/issues/8630

When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:

"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError

I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.

No.

I fully agree.

ASCII-8BIT consists 128 ASCII characters and 128 special characters to
represent 0x80 to 0xff binary bytes.

That's one way to put it, but a better way is to say that ASCII-8BIT
consists of 128 ASCII characters and 128 unassigned codepoints. This is
similar to unassigned codepoints in UTF-8.

The special characters are not representable in UTF-8.
So UndefinedConversionError is raised.

The validity of a characetr is defined by encoding, not transcoding.

Yes. Valid means that the original data as is is valid, nothing more. It
does not depend on the target encoding. And ASCII-8BIT of course can
contain bytes 0x80 and beyond, that's its job.

Regards, Martin.

#4 Updated by Akira Tanaka 9 months ago

2013/7/13 "Martin J. Dürst" duerst@it.aoyama.ac.jp:

That's one way to put it, but a better way is to say that ASCII-8BIT
consists of 128 ASCII characters and 128 unassigned codepoints. This is
similar to unassigned codepoints in UTF-8.

Your interpretation forbids us to convert binary between encodings.

For example, Emacs has charsets for binary such as eight-bit-control or
eight-bit-graphic (or eight-bit? I'm not familier with recent Emacs).

If we support a encoding which supports them and ASCII, we can convert
binary string between the encoding and ASCII-8BIT.

In your interpretation, such conversion would raise
UndefinedConversionError because unassigned codepoints can't have
character mapping for another encoding.
--
Tanaka Akira

Also available in: Atom PDF