Bug #17594: Sort order of UTF-16LE is based on binary representation instead of codepoints - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #17594

closed

Sort order of UTF-16LE is based on binary representation instead of codepoints

Added by Dan0042 (Daniel DeLorme) over 4 years ago. Updated over 4 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

Backport:

2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN

[ruby-core:102314]

Description

I just discovered that string sorting is always based on bytes, so the order of UTF-16LE strings will give some peculiar results:

BE, LE = 'UTF-16BE', 'UTF-16LE'
str = [*0..0x4ff].pack('U*').scan(/\p{Ll}/).join

puts str.encode(BE).chars.sort.first(50).join.encode('UTF-8')
#abcdefghijklmnopqrstuvwxyzµßàáâãäåæçèéêëìíîïðñòóôõ

puts str.encode(LE).chars.sort.first(50).join.encode('UTF-8')
#āȁăȃąȅćȇĉȉċȋčȍďȏđȑēȓĕȕėȗęșěțĝȝğȟġȡģȣĥȥħȧĩȩīȫĭȭįȯаı

'a'.encode(BE) < 'ā'.encode(BE) #=> true
'a'.encode(LE) < 'ā'.encode(LE) #=> false

Is this supposed to be correct? I mean, I somewhat understand the idea of just sorting by bytes, but I find the above output to be remarkably nonsensical.

A similar/related issue was found and fixed in #8653, so there's precedent for considering codepoints instead of bytes.

The reason I'm asking is because I was working on some optimizations for String#casecmp (https://github.com/ruby/ruby/pull/4133) which, as a side-effect, sort by codepoint for UTF-16LE. And that resulted in a different order for <=> vs casecmp, and thus some tests broke. But I think sorting by codepoint would be better in this case.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #17594

Sort order of UTF-16LE is based on binary representation instead of codepoints

Updated by duerst (Martin Dürst) over 4 years ago

Updated by Dan0042 (Daniel DeLorme) over 4 years ago

Updated by Dan0042 (Daniel DeLorme) over 4 years ago

summary for dev meeting¶

Updated by naruse (Yui NARUSE) over 4 years ago