Bug #2636
closedIncorrect UTF-16 string length
Description
=begin
str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3
This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.
The most strange part is that even though the length concurs with how the string is displayed when doing #inspect ("\xDC\u0BD8\x40"), but not with what #[] does. If the length is 3, then why does str[2] return nil?
=end
Updated by naruse (Yui NARUSE) almost 15 years ago
- Status changed from Open to Rejected
=begin
"\xD8\x40\xDC\x0B".force_encoding(Encoding::UTF_16BE) is corrent.
=end
Updated by naruse (Yui NARUSE) almost 15 years ago
=begin
Or following will explain this:
"\xDC\x0b\xD8\x40".force_encoding(Encoding::UTF_16BE)
=> "\xDC\u0BD8\x40"
=end
Updated by duerst (Martin Dürst) almost 15 years ago
=begin
What needs to be fixed here is the data, nothing else:
irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40
irb(main):002:> s.valid_encoding?
=> false
returning 2 for s.length may be called "somewhat more correct" than
returning 3, but in both cases, it's basically garbage in, garbage out.
Single (unpaired) surrogates are not characters in UTF-16. The most
correct answer might be "nil", in the sense of "sorry, wrong question".
The only reason #length just returns something, rather than throwing an
error, for the above case, is efficiency.
Regards, Martin.
On 2010/01/24 14:36, Tanaka Akira wrote:
2010/1/24 Vincent Isambartredmine@ruby-lang.org:
Bug #2636: Incorrect UTF-16 string length
http://redmine.ruby-lang.org/issues/show/2636str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
str.length #=> 3This string is made by inverting 2 words of a UTF-16 character not in the BMP.
The length should be 2 because it's made of two (unpaired) surrogates and not 3.Fixed.
% ./ruby -ve '
s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)
p s
p s.length'
ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]
"\xDC\x0B\xD8\x40"
2
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
=end
Updated by scritch (Vincent Isambart) almost 15 years ago
=begin
What needs to be fixed here is the data, nothing else:
irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40"
irb(main):002:> s.valid_encoding?
=> false
Yes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).
My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.
And after Tanaka Akira's fix, Ruby does exactly what I was expecting.
=end
Updated by naruse (Yui NARUSE) almost 15 years ago
- Status changed from Rejected to Closed
=begin
My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.
Ah, I see. Current behavior seems correct.
=end
Updated by duerst (Martin Dürst) almost 15 years ago
=begin
On 2010/01/25 16:37, Vincent Isambart wrote:
What needs to be fixed here is the data, nothing else:
irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'
=> "\xDC\x{BD8}\x40"
irb(main):002:> s.valid_encoding?
=> falseYes I know the data is invalid UTF-16. I created it on purpose (to
test code I'm working on for MacRuby).My main concern was that what #length and #[] were doing was different.
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and
"\x40" it would have been consistent. But s[2] was returning nil even
though s.length was 3.And after Tanaka Akira's fix, Ruby does exactly what I was expecting.
I don't oppose Akira's fix, but expecting consistent output from
inconsistent input is essentially futile. I sincerely hope nobody will
add this case to a test suite or will claim that this is THE right way
to do things.
Regards, Martin.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
=end