Feature #11094
closedRemove traces of 6-byte UTF-8
Description
UTF-8 was originally defined with a codespace up to 31 bits, and therefore with up to 6 bytes per character. Since quite a few years ago, it has been reduced in all the relevant definitions (ISO, Unicode, IETF) to a codespace up to 0x10FFFF and a maximum of 4 bytes per character. Many places in the Ruby code base are updated to this 4 byte limit (e.g. EncLen_UTF8 in enc/utf_8.c). But other places in the Ruby code base are not yet updated to this limit (e.g. code_to_mbclen in enc/utf_8.c). This should be fixed.
[I have classified this as a feature because I wasn't able to find a way to expose this problem in Ruby code, but this should be reclassified as a bug if such a problem can be found.]
Files
Updated by nobu (Nobuyoshi Nakada) about 10 years ago
And pack("U")
and unpack("U")
?
Also rubyspec seems to fail.
Array#pack with format 'U' encodes values larger than UTF-8 max codepoints ERROR
RangeError: pack(U): value out of range
Updated by nobu (Nobuyoshi Nakada) about 10 years ago
- Status changed from Open to Closed
- % Done changed from 0 to 100
Applied in changeset r50392.
enc/utf_8.c: limit UTF-8
- enc/utf_8.c (code_to_mbclen, code_to_mbc): reject values larger
than UTF-8 max codepoints. [Feature #11094]
Updated by nobu (Nobuyoshi Nakada) about 8 years ago
- Related to Bug #13353: Backport stringio fixes added
Updated by duerst (Martin Dürst) almost 8 years ago
- Related to Bug #13590: Change max byte length of UTF-8 to 4 bytes to conform to definition of UTF-8 added
Updated by duerst (Martin Dürst) almost 8 years ago
- Related to Feature #13588: Add Encoding#min_char_size, #max_char_size, #minmax_char_size added