Project

General

Profile

Actions

Bug #21559

closed

Unicode normalization nfd -> nfc -> nfd is not reversible

Bug #21559: Unicode normalization nfd -> nfc -> nfd is not reversible

Added by tompng (tomoya ishida) 2 months ago. Updated 7 days ago.

Status:
Closed
Target version:
-
[ruby-core:123146]

Description

I expect nfd(nfc(str)) == nfd(str) but found a string that doesn't.

# Ruby 3.1 - 3.5
str = "s\u{11930}\u{323}\u{11930}\u{307}"
p str.unicode_normalize(:nfd) == str.unicode_normalize(:nfc).unicode_normalize(:nfd)
#=> false
# ruby 3.5.0dev
str = "s\u{1611e}\u{323}\u{1611e}\u{307}\u{1611f}"
p str.unicode_normalize(:nfd) == str.unicode_normalize(:nfc).unicode_normalize(:nfd)
#=> false

Updated by nobu (Nobuyoshi Nakada) 2 months ago Actions #1 [ruby-core:123147]

"s\u{11930 323 11930 307}".unicode_normalize(:nfc).dump #=> "\u1E69\u{11930}\u{11930}"
"s\u{323 307}".unicode_normalize(:nfc).dump  #=> "\u1E69"

Are U+0323 and U+0307 composed to s jumping over U+11930?

Updated by ima1zumi (Mari Imaizumi) 2 months ago Actions #2 [ruby-core:123148]

  • Assignee set to ima1zumi (Mari Imaizumi)

This looks like a bug. Per Unicode TR15, the identity toNFD(x) == toNFD(toNFC(x)) must be maintained. https://unicode.org/reports/tr15/#Design_Goals
It seems the NFC process is combining characters across U+11930, even though its CCC is 0.

CC: @duerst (Martin Dürst)

Updated by duerst (Martin Dürst) 2 months ago Actions #3 [ruby-core:123154]

  • Assignee changed from ima1zumi (Mari Imaizumi) to duerst (Martin Dürst)

@ima1zumi (Mari Imaizumi) Not sure this is even allowed, but I'm sure I'm responsible for this behavior, and want to fix it myself, so I change the Assignee to myself.

Updated by ima1zumi (Mari Imaizumi) 2 months ago Actions #4 [ruby-core:123160]

@duerst (Martin Dürst) Thank you, I appreciate you taking care of it.

Updated by duerst (Martin Dürst) 8 days ago Actions #5 [ruby-core:123639]

  • Status changed from Open to Closed

Updated by duerst (Martin Dürst) 8 days ago Actions #6 [ruby-core:123640]

  • Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN to 3.2: DONTNEED, 3.3: DONTNEED, 3.4: DONTNEED

Backport would only be needed if the upgrade to Unicode 16.0.0 (see https://bugs.ruby-lang.org/issues/20724) is backported.

Updated by duerst (Martin Dürst) 7 days ago Actions #7 [ruby-core:123656]

Note to potential backporters: https://github.com/ruby/ruby/commit/bd51b20c50 should also be backported.

Actions

Also available in: PDF Atom