Project

General

Profile

Actions

Bug #14934

closed

Unicode: Hangul normalize bug

Added by MaLin (Lin Ma) over 5 years ago. Updated over 5 years ago.

Status:
Closed
Target version:
-
[ruby-core:88070]

Description

I was involved to fix a similar bug in Python, I found Ruby also has bug code.

We should fix this line[1] like this:
[1] https://github.com/ruby/ruby/blob/96db72ce38b27799dd8e80ca00696e41234db6ba/lib/unicode_normalize/normalize.rb#L73

-if length>2 and 0 <= (trail=string[2].ord-TBASE) and trail < TCOUNT
+if length>2 and 0 < (trail=string[2].ord-TBASE) and trail < TCOUNT


There was a change of Unicode Standard's demonstration code.

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with Unicode 10.0
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.

Please note: The normalize algorithm didn't changed, only the demonstration code changed, see this discussion[2] about this point.
[2] https://bugs.python.org/issue29456


Here is some test code[3] for Python, maybe useful for this fix.
[3] https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5

Updated by MaLin (Lin Ma) over 5 years ago

I'm not a expert of Ruby, I would suggest to examine the code of Hangul normalize, it looks too simple than the Unicode Standard's demonstration code.

Updated by duerst (Martin Dürst) over 5 years ago

  • Assignee set to duerst (Martin Dürst)

Updated by duerst (Martin Dürst) over 5 years ago

@MaLin (Lin Ma): Can you provide some test case(s)?

All the test data provided by Unicode (e.g. https://www.unicode.org/Public/10.0.0/ucd/NormalizationTest.txt) is used for testing. But I know that there are some corner cases in Hangul that are not covered by that.

I won't have much time to look at this issue this week. I'll get around to it next week (maybe even this Friday).

Updated by MaLin (Lin Ma) over 5 years ago

Can you provide some test case(s)?

That is what frustrated me. I simply translated Python's test-cases for this issue[1] to Ruby.
[1] https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5

But them passed without rasing exception.

Ruby's code seems relatived to the \u11a7 character.

I won't have much time to look at this issue this week. I'll get around to it next week (maybe even this Friday).

Need not hurry, it's a very old bug, and passed test-cases mystically.

Updated by duerst (Martin Dürst) over 5 years ago

I have quickly looked at the tests. I understand what issue they are checking (NFC in case of a mix of current Hangul and old-style-only Hangul jamos). I think Ruby should pass these tests.

Ruby has tests for this case, see
https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/test_unicode_normalize.rb?revision=55567&view=markup#l159 and the following 6 lines. But the individual code point values are different from your python tests. I'll have a very close look at the code itself hopefully on Friday afternoon.

Updated by duerst (Martin Dürst) over 5 years ago

I think I have figured things out:

The patch is technically correct. While LBASE and VBASE are the values of the first actual leading and vowel jamos, the value of TBASE is one smaller than the first actual trailing jamo at 0x11A8. This is to account for the fact that the lowest value of the "trailing digit" of the Hangul syllable representation indicates the absence of a trailing jamo. So in contrast to the <= tests related to LBASE and VBASE, it is indeed technically correct to have a < comparison operator in the comparison related to TBASE.

However, I have also figured out why this apparent bug doesn't actually affect Ruby. The reason is that we use regular expressions to extract "normalization runs" from the string to be normalized. We know that a U+11A7 character can never participate in a normalization operation because it is a classical Hangul Jamo not used in modern Hangul. So U+11A7 never appears in a normalization run, and there's thus no error.

Updated by MaLin (Lin Ma) over 5 years ago

Get it. :)

Updated by duerst (Martin Dürst) over 5 years ago

I committed the tests adapted from Python and the fix of the comparison operator, because it's technically correct and we never know when this would lead to an actual bug if something somewhere else in the code gets changed.

@MaLin (Lin Ma), thanks again for the report, this helped me find another (real!) bug with file names fixed at r64085, and make an improvement at r64086.

Updated by duerst (Martin Dürst) over 5 years ago

  • Status changed from Open to Closed
  • Backport changed from 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN to 2.3: DONTNEED, 2.4: DONTNEED, 2.5: DONTNEED

Closed. Because there is no actual bug, there is no need to backport this.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0