Bug #21162
openRegexp casefold mismatch for latin1 supplemental chars
Description
Originally reported to joni repo with a possible fix here: https://github.com/jruby/joni/pull/20
From that PR:
When a character is less than or equal to single byte size (0xff),
yet it takes more than 1 byte in the current encoding, the
case folding code incorrectly put it in bitset instead of code
range. As a result, for utf8 encoding, casefold works incorrectly
on characters in range \u0080 to \u00ff (latin1 supplement).Before fix:
"\u00c2" [\u00e0-\u00e5] returns false
"\u00c2" [\u00e2] returns false
"\u00c2" \u00e2 returns true
As a Ruby example:
$ ruby -v -e 'p(/[\u00e0-\u00e5]/i =~ "\u00c2")'
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
nil
The PR above was rebased in https://github.com/jruby/joni/pull/85. When that patch is incorporated into JRuby, it behaves as expected:
$ ruby -v -e 'p(/[\u00e0-\u00e5]/i =~ "\u00c2")'
jruby 10.0.0.0-SNAPSHOT (3.4.0) 2025-02-27 c0e5008419 OpenJDK 64-Bit Server VM 21.0.5+11-LTS on 21.0.5+11-LTS +indy +jit [arm64-darwin]
0
This bug may affect other casefold situations. As I am not very familiar with this code in CRuby (other than from maintaining the port in joni) I would like others to evaluate this fix and help find other places it is needed.
Updated by nobu (Nobuyoshi Nakada) 6 days ago
Sounds like same as #16145.
Updated by headius (Charles Nutter) 3 days ago
@nobu (Nobuyoshi Nakada) Certainly could be and the fix looks similar.
@mjrzasa What do you think? Does your fix repair this problem?