https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112010-01-24T13:28:28ZRuby Issue Tracking SystemRuby master - Bug #2636: Incorrect UTF-16 string lengthhttps://bugs.ruby-lang.org/issues/2636?journal_id=77992010-01-24T13:28:28Znaruse (Yui NARUSE)naruse@airemix.jp
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Rejected</i></li></ul><p>=begin<br>
"\xD8\x40\xDC\x0B".force_encoding(Encoding::UTF_16BE) is corrent.<br>
=end</p> Ruby master - Bug #2636: Incorrect UTF-16 string lengthhttps://bugs.ruby-lang.org/issues/2636?journal_id=78002010-01-24T13:30:04Znaruse (Yui NARUSE)naruse@airemix.jp
<ul></ul><p>=begin<br>
Or following will explain this:</p>
<blockquote>
<p>"\xDC\x0b\xD8\x40".force_encoding(Encoding::UTF_16BE)<br>
=> "\xDC\u0BD8\x40"</p>
</blockquote>
<p>=end</p> Ruby master - Bug #2636: Incorrect UTF-16 string lengthhttps://bugs.ruby-lang.org/issues/2636?journal_id=78292010-01-25T16:09:03Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
What needs to be fixed here is the data, nothing else:</p>
<p>irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'<br>
=> "\xDC\x{BD8}\x40<br>
irb(main):002:> s.valid_encoding?<br>
=> false</p>
<p>returning 2 for s.length may be called "somewhat more correct" than<br>
returning 3, but in both cases, it's basically garbage in, garbage out.<br>
Single (unpaired) surrogates are not characters in UTF-16. The most<br>
correct answer might be "nil", in the sense of "sorry, wrong question".</p>
<p>The only reason #length just returns something, rather than throwing an<br>
error, for the above case, is efficiency.</p>
<p>Regards, Martin.</p>
<p>On 2010/01/24 14:36, Tanaka Akira wrote:</p>
<blockquote>
<p>2010/1/24 Vincent Isambart<a href="mailto:redmine@ruby-lang.org" class="email">redmine@ruby-lang.org</a>:</p>
<blockquote>
<p>Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Incorrect UTF-16 string length (Closed)" href="https://bugs.ruby-lang.org/issues/2636">#2636</a>: Incorrect UTF-16 string length<br>
<a href="http://redmine.ruby-lang.org/issues/show/2636" class="external">http://redmine.ruby-lang.org/issues/show/2636</a></p>
</blockquote>
<blockquote>
<p>str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)<br>
str.length #=> 3</p>
<p>This string is made by inverting 2 words of a UTF-16 character not in the BMP.<br>
The length should be 2 because it's made of two (unpaired) surrogates and not 3.</p>
</blockquote>
<p>Fixed.</p>
<p>% ./ruby -ve '<br>
s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)<br>
p s<br>
p s.length'<br>
ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]<br>
"\xDC\x0B\xD8\x40"<br>
2</p>
</blockquote>
<p>--<br>
#-# Martin J. Dürst, Professor, Aoyama Gakuin University<br>
#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p> Ruby master - Bug #2636: Incorrect UTF-16 string lengthhttps://bugs.ruby-lang.org/issues/2636?journal_id=78302010-01-25T16:37:53Zscritch (Vincent Isambart)vincent.isambart@gmail.com
<ul></ul><p>=begin</p>
<blockquote>
<p>What needs to be fixed here is the data, nothing else:</p>
<p>irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'<br>
=> "\xDC\x{BD8}\x40"<br>
irb(main):002:> s.valid_encoding?<br>
=> false</p>
</blockquote>
<p>Yes I know the data is invalid UTF-16. I created it on purpose (to<br>
test code I'm working on for MacRuby).</p>
<p>My main concern was that what #length and #[] were doing was different.<br>
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and<br>
"\x40" it would have been consistent. But s[2] was returning nil even<br>
though s.length was 3.</p>
<p>And after Tanaka Akira's fix, Ruby does exactly what I was expecting.</p>
<p>=end</p> Ruby master - Bug #2636: Incorrect UTF-16 string lengthhttps://bugs.ruby-lang.org/issues/2636?journal_id=78322010-01-25T16:42:44Znaruse (Yui NARUSE)naruse@airemix.jp
<ul><li><strong>Status</strong> changed from <i>Rejected</i> to <i>Closed</i></li></ul><p>=begin</p>
<blockquote>
<p>My main concern was that what #length and #[] were doing was different.<br>
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and<br>
"\x40" it would have been consistent. But s[2] was returning nil even<br>
though s.length was 3.</p>
</blockquote>
<p>Ah, I see. Current behavior seems correct.<br>
=end</p> Ruby master - Bug #2636: Incorrect UTF-16 string lengthhttps://bugs.ruby-lang.org/issues/2636?journal_id=79402010-01-27T18:27:31Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>=begin<br>
On 2010/01/25 16:37, Vincent Isambart wrote:</p>
<blockquote>
<blockquote>
<p>What needs to be fixed here is the data, nothing else:</p>
<p>irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'<br>
=> "\xDC\x{BD8}\x40"<br>
irb(main):002:> s.valid_encoding?<br>
=> false</p>
</blockquote>
<p>Yes I know the data is invalid UTF-16. I created it on purpose (to<br>
test code I'm working on for MacRuby).</p>
<p>My main concern was that what #length and #[] were doing was different.<br>
If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and<br>
"\x40" it would have been consistent. But s[2] was returning nil even<br>
though s.length was 3.</p>
<p>And after Tanaka Akira's fix, Ruby does exactly what I was expecting.</p>
</blockquote>
<p>I don't oppose Akira's fix, but expecting consistent output from<br>
inconsistent input is essentially futile. I sincerely hope nobody will<br>
add this case to a test suite or will claim that this is THE right way<br>
to do things.</p>
<p>Regards, Martin.</p>
<p>--<br>
#-# Martin J. Dürst, Professor, Aoyama Gakuin University<br>
#-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p>
<p>=end</p>