Bug #2636
Incorrect UTF-16 string length
| Status: | Closed | Start date: | 01/24/2010 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | - | % Done: | 0% |
|
| Category: | M17N | |||
| Target version: | 1.9.2 | |||
| ruby -v: | ruby 1.9.2dev (2010-01-22 trunk 26370) [x86_64-darwin10.2.0] |
Description
str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3
This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.
The most strange part is that even though the length concurs with how the string is displayed when doing #inspect ("\xDC\u0BD8\x40"), but not with what #[] does. If the length is 3, then why does str[2] return nil?
Associated revisions
* string.c (rb_enc_strlen_cr): increment by rb_enc_mbminlen(enc) for
broken byte sequence. [ruby-core:27748]
* string.c (rb_str_inspect): increment by rb_enc_mbminlen(enc) for
broken byte sequence. [ruby-core:27748]
History
Updated by naruse (Yui NARUSE) over 2 years ago
- Status changed from Open to Rejected
"\xD8\x40\xDC\x0B".force_encoding(Encoding::UTF_16BE) is corrent.
Updated by naruse (Yui NARUSE) over 2 years ago
Or following will explain this: > "\xDC\x0b\xD8\x40".force_encoding(Encoding::UTF_16BE) => "\xDC\u0BD8\x40"
Updated by duerst (Martin Dürst) over 2 years ago
What needs to be fixed here is the data, nothing else:
irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40
irb(main):002:> s.valid_encoding?
=> false
returning 2 for s.length may be called "somewhat more correct" than
returning 3, but in both cases, it's basically garbage in, garbage out.
Single (unpaired) surrogates are not characters in UTF-16. The most
correct answer might be "nil", in the sense of "sorry, wrong question".
The only reason #length just returns something, rather than throwing an
error, for the above case, is efficiency.
Regards, Martin.
On 2010/01/24 14:36, Tanaka Akira wrote:
> 2010/1/24 Vincent Isambart<redmine@ruby-lang.org>:
>> Bug #2636: Incorrect UTF-16 string length
>> http://redmine.ruby-lang.org/issues/show/2636
>
>> str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
>> str.length #=> 3
>>
>> This string is made by inverting 2 words of a UTF-16 character not in the BMP.
>> The length should be 2 because it's made of two (unpaired) surrogates and not 3.
>
> Fixed.
>
> % ./ruby -ve '
> s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
> p s
> p s.length'
> ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]
> "\xDC\x0B\xD8\x40"
> 2
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Updated by scritch (Vincent Isambart) over 2 years ago
> What needs to be fixed here is the data, nothing else:
>
> irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
> => "\xDC\x{BD8}\x40"
> irb(main):002:> s.valid_encoding?
> => false
Yes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).
My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.
And after Tanaka Akira's fix, Ruby does exactly what I was expecting.
Updated by naruse (Yui NARUSE) over 2 years ago
- Status changed from Rejected to Closed
> My main concern was that what #length and #[] were doing was different.
> If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
> "\x40" it would have been consistent. But s[2] was returning nil even
> though s.length was 3.
Ah, I see. Current behavior seems correct.
Updated by duerst (Martin Dürst) over 2 years ago
On 2010/01/25 16:37, Vincent Isambart wrote:
>> What needs to be fixed here is the data, nothing else:
>>
>> irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
>> => "\xDC\x{BD8}\x40"
>> irb(main):002:> s.valid_encoding?
>> => false
>
> Yes I know the data is invalid UTF-16. I created it on purpose (to
> test code I'm working on for MacRuby).
>
> My main concern was that what #length and #[] were doing was different.
> If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
> "\x40" it would have been consistent. But s[2] was returning nil even
> though s.length was 3.
>
> And after Tanaka Akira's fix, Ruby does exactly what I was expecting.
I don't oppose Akira's fix, but expecting consistent output from
inconsistent input is essentially futile. I sincerely hope nobody will
add this case to a test suite or will claim that this is THE right way
to do things.
Regards, Martin.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp