Ruby master - Bug #2636: Incorrect UTF-16 string length</h1> <article> <h1>Ruby master - Bug #2636: Incorrect UTF-16 string length</h1> <p>2010-01-24T13:28:28Z</p> <ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Rejected</i></li></ul><p>=begin<br> "\xD8\x40\xDC\x0B".force_encoding(Encoding::UTF_16BE) is corrent.<br> =end</p> </article> <article> <h1>Ruby master - Bug #2636: Incorrect UTF-16 string length</h1> <p>2010-01-24T13:30:04Z</p> <ul></ul><p>=begin<br> Or following will explain this:</p> <blockquote> <p>"\xDC\x0b\xD8\x40".force_encoding(Encoding::UTF_16BE)<br> => "\xDC\u0BD8\x40"</p> </blockquote> <p>=end</p> </article> <article> <h1>Ruby master - Bug #2636: Incorrect UTF-16 string length</h1> <p>2010-01-25T16:09:03Z</p> <ul></ul><p>=begin<br> What needs to be fixed here is the data, nothing else:</p> <p>irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'<br> => "\xDC\x{BD8}\x40<br> irb(main):002:> s.valid_encoding?<br> => false</p> <p>returning 2 for s.length may be called "somewhat more correct" than<br> returning 3, but in both cases, it's basically garbage in, garbage out.<br> Single (unpaired) surrogates are not characters in UTF-16. The most<br> correct answer might be "nil", in the sense of "sorry, wrong question".</p> <p>The only reason #length just returns something, rather than throwing an<br> error, for the above case, is efficiency.</p> <p>Regards, Martin.</p> <p>On 2010/01/24 14:36, Tanaka Akira wrote:</p> <blockquote> <p>2010/1/24 Vincent Isambart<a href="mailto:redmine@ruby-lang.org" class="email">redmine@ruby-lang.org</a>:</p> <blockquote> <p>Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Incorrect UTF-16 string length (Closed)" href="https://bugs.ruby-lang.org/issues/2636">#2636</a>: Incorrect UTF-16 string length<br> <a href="http://redmine.ruby-lang.org/issues/show/2636" class="external">http://redmine.ruby-lang.org/issues/show/2636</a></p> </blockquote> <blockquote> <p>str = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)<br> str.length #=> 3</p> <p>This string is made by inverting 2 words of a UTF-16 character not in the BMP.<br> The length should be 2 because it's made of two (unpaired) surrogates and not 3.</p> </blockquote> <p>Fixed.</p> <p>% ./ruby -ve '<br> s = "\xDC\x0B\xD8\x40".force_encoding(Encoding::UTF_16BE)<br> p s<br> p s.length'<br> ruby 1.9.2dev (2010-01-24 trunk 26392) [i686-linux]<br> "\xDC\x0B\xD8\x40"<br> 2</p> </blockquote> <p>--<br> #-# Martin J. Dürst, Professor, Aoyama Gakuin University<br> #-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #2636: Incorrect UTF-16 string length</h1> <p>2010-01-25T16:37:53Z</p> <ul></ul><p>=begin</p> <blockquote> <p>What needs to be fixed here is the data, nothing else:</p> <p>irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'<br> => "\xDC\x{BD8}\x40"<br> irb(main):002:> s.valid_encoding?<br> => false</p> </blockquote> <p>Yes I know the data is invalid UTF-16. I created it on purpose (to<br> test code I'm working on for MacRuby).</p> <p>My main concern was that what #length and #[] were doing was different.<br> If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and<br> "\x40" it would have been consistent. But s[2] was returning nil even<br> though s.length was 3.</p> <p>And after Tanaka Akira's fix, Ruby does exactly what I was expecting.</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #2636: Incorrect UTF-16 string length</h1> <p>2010-01-25T16:42:44Z</p> <ul><li><strong>Status</strong> changed from <i>Rejected</i> to <i>Closed</i></li></ul><p>=begin</p> <blockquote> <p>My main concern was that what #length and #[] were doing was different.<br> If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and<br> "\x40" it would have been consistent. But s[2] was returning nil even<br> though s.length was 3.</p> </blockquote> <p>Ah, I see. Current behavior seems correct.<br> =end</p> </article> <article> <h1>Ruby master - Bug #2636: Incorrect UTF-16 string length</h1> <p>2010-01-27T18:27:31Z</p> <ul></ul><p>=begin<br> On 2010/01/25 16:37, Vincent Isambart wrote:</p> <blockquote> <blockquote> <p>What needs to be fixed here is the data, nothing else:</p> <p>irb(main):001:> s = "\xDC\x0B\xD8\x40".force_encoding 'UTF-16BE'<br> => "\xDC\x{BD8}\x40"<br> irb(main):002:> s.valid_encoding?<br> => false</p> </blockquote> <p>Yes I know the data is invalid UTF-16. I created it on purpose (to<br> test code I'm working on for MacRuby).</p> <p>My main concern was that what #length and #[] were doing was different.<br> If s[0], s[1], s[2] would have been returning "\xDC", "\x{BD8}" and<br> "\x40" it would have been consistent. But s[2] was returning nil even<br> though s.length was 3.</p> <p>And after Tanaka Akira's fix, Ruby does exactly what I was expecting.</p> </blockquote> <p>I don't oppose Akira's fix, but expecting consistent output from<br> inconsistent input is essentially futile. I sincerely hope nobody will<br> add this case to a test suite or will claim that this is THE right way<br> to do things.</p> <p>Regards, Martin.</p> <p>--<br> #-# Martin J. Dürst, Professor, Aoyama Gakuin University<br> #-# <a href="http://www.sw.it.aoyama.ac.jp" class="external">http://www.sw.it.aoyama.ac.jp</a> <a href="mailto:duerst@it.aoyama.ac.jp" class="email">mailto:duerst@it.aoyama.ac.jp</a></p> <p>=end</p> </article> </main></body></html>