Bug #2636

Incorrect UTF-16 string length

Added by scritch (Vincent Isambart) over 2 years ago. Updated about 1 year ago.

[ruby-core:27748]
Status:Closed Start date:01/24/2010
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:M17N
Target version:1.9.2
ruby -v:ruby 1.9.2dev (2010-01-22 trunk 26370) [x86_64-darwin10.2.0]

Description

str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3

This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.

The most strange part is that even though the length concurs with how the string is displayed when doing #inspect ("\xDC\u0BD8\x40"), but not with what #[] does. If the length is 3, then why does str[2] return nil?

Associated revisions

Revision 26392
Added by akr (Akira Tanaka) over 2 years ago

* string.c (rb_enc_strlen_cr): increment by rb_enc_mbminlen(enc) for broken byte sequence. [ruby-core:27748]

Revision 26393
Added by akr (Akira Tanaka) over 2 years ago

* string.c (rb_str_inspect): increment by rb_enc_mbminlen(enc) for broken byte sequence. [ruby-core:27748]

History

Updated by naruse (Yui NARUSE) over 2 years ago

  • Status changed from Open to Rejected
"\xD8\x40\xDC\x0B".force_encoding(Encoding::UTF_16BE) is corrent.

Updated by naruse (Yui NARUSE) over 2 years ago

Or following will explain this:
> "\xDC\x0b\xD8\x40".force_encoding(Encoding::UTF_16BE)
=> "\xDC\u0BD8\x40"

Updated by duerst (Martin Dürst) over 2 years ago

What needs to be fixed here is the data, nothing else:

irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40
irb(main):002:> s.valid_encoding?
=> false

returning 2 for s.length may be called "somewhat more correct" than 
returning 3, but in both cases, it's basically garbage in, garbage out. 
Single (unpaired) surrogates are not characters in UTF-16. The most 
correct answer might be "nil", in the sense of "sorry, wrong question".

The only reason #length just returns something, rather than throwing an 
error, for the above case, is efficiency.

Regards,    Martin.


On 2010/01/24 14:36, Tanaka Akira wrote:
> 2010/1/24 Vincent Isambart<redmine@ruby-lang.org>:
>> Bug #2636: Incorrect UTF-16 string length
>> http://redmine.ruby-lang.org/issues/show/2636
>
>> str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
>> str.length #=>  3
>>
>> This string is made by inverting 2 words of a UTF-16 character not in the BMP.
>> The length should be 2 because it's made of two (unpaired) surrogates and not 3.
>
> Fixed.
>
> % ./ruby -ve '
> s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
> p s
> p s.length'
> ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]
> "\xDC\x0B\xD8\x40"
> 2

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Updated by scritch (Vincent Isambart) over 2 years ago

> What needs to be fixed here is the data, nothing else:
>
> irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
> => "\xDC\x{BD8}\x40"
> irb(main):002:> s.valid_encoding?
> => false

Yes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).

My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.

And after Tanaka Akira's fix, Ruby does exactly what I was expecting.

Updated by naruse (Yui NARUSE) over 2 years ago

  • Status changed from Rejected to Closed
> My main concern was that what #length and #[] were doing was different.
> If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
> "\x40" it would have been consistent. But s[2] was returning nil even
> though s.length was 3.

Ah, I see. Current behavior seems correct.

Updated by duerst (Martin Dürst) over 2 years ago

On 2010/01/25 16:37, Vincent Isambart wrote:
>> What needs to be fixed here is the data, nothing else:
>>
>> irb(main):001:>  s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
>> =>  "\xDC\x{BD8}\x40"
>> irb(main):002:>  s.valid_encoding?
>> =>  false
>
> Yes I know the data is invalid UTF-16. I created it on purpose (to
> test code I'm working on for MacRuby).
>
> My main concern was that what #length and #[] were doing was different.
> If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
> "\x40" it would have been consistent. But s[2] was returning nil even
> though s.length was 3.
>
> And after Tanaka Akira's fix, Ruby does exactly what I was expecting.

I don't oppose Akira's fix, but expecting consistent output from 
inconsistent input is essentially futile. I sincerely hope nobody will 
add this case to a test suite or will claim that this is THE right way 
to do things.

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Also available in: Atom PDF