Project

General

Profile

Bug #14934

Unicode: Hangul normalize bug

Added by MaLin (Lin Ma) about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Target version:
-
[ruby-core:88070]

Description

I was involved to fix a similar bug in Python, I found Ruby also has bug code.

We should fix this line[1] like this:
[1] https://github.com/ruby/ruby/blob/96db72ce38b27799dd8e80ca00696e41234db6ba/lib/unicode_normalize/normalize.rb#L73

-if length>2 and 0 <= (trail=string[2].ord-TBASE) and trail < TCOUNT
+if length>2 and 0 < (trail=string[2].ord-TBASE) and trail < TCOUNT


There was a change of Unicode Standard's demonstration code.

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with Unicode 10.0
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.

Please note: The normalize algorithm didn't changed, only the demonstration code changed, see this discussion[2] about this point.
[2] https://bugs.python.org/issue29456


Here is some test code[3] for Python, maybe useful for this fix.
[3] https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5

Associated revisions

Revision 64087
Added by duerst (Martin Dürst) about 1 year ago

fix range check for Hangul jamo trailers in Unicode normalization

  • lib/unicode_normalize/normalize.rb: Fix the range check for trailing
    Hangul jamo characters in Unicode normalization. Different from
    leading or vowel jamos, where LBASE and VBASE are actual characters,
    a value equal to TBASE expresses the absence of a trailing jamo.
    This fix is technically correct, but there was no bug because
    the regular expressions in lib/unicode_normalize/tables.rb
    eliminate jamos equal to TBASE from normalization processing.

  • test/test_unicode_normalize.rb: Add preventive test
    test_no_trailing_jamo based on
    https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5
    just for the case we ever get a regression.

This closes issue #14934, thanks to MaLin (Lin Ma) for reporting.

Revision 64087
Added by duerst (Martin Dürst) about 1 year ago

fix range check for Hangul jamo trailers in Unicode normalization

  • lib/unicode_normalize/normalize.rb: Fix the range check for trailing
    Hangul jamo characters in Unicode normalization. Different from
    leading or vowel jamos, where LBASE and VBASE are actual characters,
    a value equal to TBASE expresses the absence of a trailing jamo.
    This fix is technically correct, but there was no bug because
    the regular expressions in lib/unicode_normalize/tables.rb
    eliminate jamos equal to TBASE from normalization processing.

  • test/test_unicode_normalize.rb: Add preventive test
    test_no_trailing_jamo based on
    https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5
    just for the case we ever get a regression.

This closes issue #14934, thanks to MaLin (Lin Ma) for reporting.

History

Updated by MaLin (Lin Ma) about 1 year ago

I'm not a expert of Ruby, I would suggest to examine the code of Hangul normalize, it looks too simple than the Unicode Standard's demonstration code.

Updated by duerst (Martin Dürst) about 1 year ago

  • Assignee set to duerst (Martin Dürst)

Updated by duerst (Martin Dürst) about 1 year ago

MaLin (Lin Ma): Can you provide some test case(s)?

All the test data provided by Unicode (e.g. https://www.unicode.org/Public/10.0.0/ucd/NormalizationTest.txt) is used for testing. But I know that there are some corner cases in Hangul that are not covered by that.

I won't have much time to look at this issue this week. I'll get around to it next week (maybe even this Friday).

Updated by MaLin (Lin Ma) about 1 year ago

Can you provide some test case(s)?

That is what frustrated me. I simply translated Python's test-cases for this issue[1] to Ruby.
[1] https://github.com/python/cpython/commit/d134809cd3764c6a634eab7bb8995e3e2eff14d5

But them passed without rasing exception.

Ruby's code seems relatived to the \u11a7 character.

I won't have much time to look at this issue this week. I'll get around to it next week (maybe even this Friday).

Need not hurry, it's a very old bug, and passed test-cases mystically.

Updated by duerst (Martin Dürst) about 1 year ago

I have quickly looked at the tests. I understand what issue they are checking (NFC in case of a mix of current Hangul and old-style-only Hangul jamos). I think Ruby should pass these tests.

Ruby has tests for this case, see
https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/test_unicode_normalize.rb?revision=55567&view=markup#l159 and the following 6 lines. But the individual code point values are different from your python tests. I'll have a very close look at the code itself hopefully on Friday afternoon.

Updated by duerst (Martin Dürst) about 1 year ago

I think I have figured things out:

The patch is technically correct. While LBASE and VBASE are the values of the first actual leading and vowel jamos, the value of TBASE is one smaller than the first actual trailing jamo at 0x11A8. This is to account for the fact that the lowest value of the "trailing digit" of the Hangul syllable representation indicates the absence of a trailing jamo. So in contrast to the <= tests related to LBASE and VBASE, it is indeed technically correct to have a < comparison operator in the comparison related to TBASE.

However, I have also figured out why this apparent bug doesn't actually affect Ruby. The reason is that we use regular expressions to extract "normalization runs" from the string to be normalized. We know that a U+11A7 character can never participate in a normalization operation because it is a classical Hangul Jamo not used in modern Hangul. So U+11A7 never appears in a normalization run, and there's thus no error.

Updated by MaLin (Lin Ma) about 1 year ago

Get it. :)

Updated by duerst (Martin Dürst) about 1 year ago

I committed the tests adapted from Python and the fix of the comparison operator, because it's technically correct and we never know when this would lead to an actual bug if something somewhere else in the code gets changed.

MaLin (Lin Ma), thanks again for the report, this helped me find another (real!) bug with file names fixed at r64085, and make an improvement at r64086.

Updated by duerst (Martin Dürst) about 1 year ago

  • Backport changed from 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN to 2.3: DONTNEED, 2.4: DONTNEED, 2.5: DONTNEED
  • Status changed from Open to Closed

Closed. Because there is no actual bug, there is no need to backport this.

Also available in: Atom PDF