Project

General

Profile

Actions

Bug #17594

closed

Sort order of UTF-16LE is based on binary representation instead of codepoints

Added by Dan0042 (Daniel DeLorme) about 3 years ago. Updated about 3 years ago.

Status:
Rejected
Assignee:
-
Target version:
-
[ruby-core:102314]

Description

I just discovered that string sorting is always based on bytes, so the order of UTF-16LE strings will give some peculiar results:

BE, LE = 'UTF-16BE', 'UTF-16LE'
str = [*0..0x4ff].pack('U*').scan(/\p{Ll}/).join

puts str.encode(BE).chars.sort.first(50).join.encode('UTF-8')
#abcdefghijklmnopqrstuvwxyzµßàáâãäåæçèéêëìíîïðñòóôõ

puts str.encode(LE).chars.sort.first(50).join.encode('UTF-8')
#āȁăȃąȅćȇĉȉċȋčȍďȏđȑēȓĕȕėȗęșěțĝȝğȟġȡģȣĥȥħȧĩȩīȫĭȭįȯаı

'a'.encode(BE) < 'ā'.encode(BE) #=> true
'a'.encode(LE) < 'ā'.encode(LE) #=> false

Is this supposed to be correct? I mean, I somewhat understand the idea of just sorting by bytes, but I find the above output to be remarkably nonsensical.

A similar/related issue was found and fixed in #8653, so there's precedent for considering codepoints instead of bytes.

The reason I'm asking is because I was working on some optimizations for String#casecmp (https://github.com/ruby/ruby/pull/4133) which, as a side-effect, sort by codepoint for UTF-16LE. And that resulted in a different order for <=> vs casecmp, and thus some tests broke. But I think sorting by codepoint would be better in this case.

Updated by duerst (Martin Dürst) about 3 years ago

Real high-quality string searching would need language information, because sorting differs by language (e.g. German sorts ä with a, but Swedish sorts it after z). Binary sorting may work for ASCII, but even there, it doesn't consider case equivalence. Implementing high-quality sort is a major project.

UTF-8 is Ruby's main encoding. UTF-16BE and UTF-16LE are 'tolerated', so to say. Because of the nature of UTF-16LE, the result when sorting by bytes is indeed quite nonsensical; your example only scratches the surface. But changing it to sort by code unit (i.e. two bytes at a time) would just introduce a special case for not really much actual gain.

Updated by Dan0042 (Daniel DeLorme) about 3 years ago

I agree that real high-quality string sorting requires a specialized library, but there's no need to aim so high here. Perfect is the enemy of good. I think it would be good to have a minimum level of consistency between Unicode encodings.

Or maybe per-codepoint ordering could be only for casecmp? Since it already has a different order from case-sensitive sort anyway...

Updated by Dan0042 (Daniel DeLorme) about 3 years ago

Technically speaking I'm not sure this is a bug per se. The code correctly implements the spec of sorting by byte, but we might consider this a bug in the spec, or a feature request?

summary for dev meeting

For most encodings, the byte ordering is the same as the codepoint ordering, except for these:
UTF-16LE: 256 < 255
UTF-32LE: 256 < 255
Windows-31J: 33088 < 223
Shift_JIS: 33088 < 223
MacJapanese: 33088 < 223
SJIS-DoCoMo: 33088 < 223
SJIS-KDDI: 33088 < 223
SJIS-SoftBank: 33088 < 223
UTF-16BE: 65536 < 65535
UTF-16: 65536 < 65535

For the UTF family of encodings it would be more consistent if "a" < "ā" always, regardless of encoding. The UTF encodings are self-synchronizing, so there's no performance cost to this change. It's simple to search for the first byte difference and then find the codepoint at that location.

The SJIS family of encodings should not be changed because

  1. they are not self-synchronizing, so sorting by codepoint require parsing the entire string;
  2. the current byte ordering results in あ < ア < ア, which is the same as Unicode. Codepoint sorting would result in ア < あ < ア; this incompatibility is not desirable

For these encodings, str1.casecmp(str2) scans each codepoint in the two strings and
a) if both codepoints are ascii, lowercase and compare
b) else, throw away the codepoints and compare by byte
so it would be simpler and more efficient to just compare by codepoint (and for SJIS use byte comparison only once two different codepoints are found)

Updated by naruse (Yui NARUSE) about 3 years ago

  • Status changed from Open to Rejected

In Ruby, UTF-16 and UTF-32 are not a first citizen. This behavior is by design.
As you say we can implement better comparison, but we didn't think there are no real needs for UTF-16.

If you have real world needs, we can re-consider, but as far as I understand you have only theoretical proposal.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0