Project

General

Profile

Actions

Bug #8630

closed

Transcoding high-bit bytes from ASCII-8BIT to a text encoding should be :invalid, not :undef

Added by headius (Charles Nutter) almost 11 years ago. Updated almost 11 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
2.0.0
[ruby-core:55984]

Description

When transcoding from ASCII-8BIT (BINARY) to a text encoding (e.g. UTF-8), MRI will raise an error for high-bit bytes:

"\xC3".encode("utf-8", "binary") # => Encoding::UndefinedConversionError

This can be disabled by passing :undef => :replace as an option to the encode call.

I believe that "undef" is the wrong treatment for this error. Undef means that the input character has no representation in the target encoding. In this case, the error is raised because only US-ASCII range of bytes are valid for transcoding, so the transcoding of high-bit bytes is by definition invalid, not undefined. In other words, high-bit bytes in ASCII-8BIT/BINARY are invalid as characters.

The error raised should be InvalidByteSequenceError and it should be prevented by using :invalid => :replace option.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0