Project

General

Profile

Actions

Bug #21162

open

Regexp casefold mismatch for latin1 supplemental chars

Added by headius (Charles Nutter) 7 days ago. Updated 3 days ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:121201]

Description

Originally reported to joni repo with a possible fix here: https://github.com/jruby/joni/pull/20

From that PR:

When a character is less than or equal to single byte size (0xff),
yet it takes more than 1 byte in the current encoding, the
case folding code incorrectly put it in bitset instead of code
range. As a result, for utf8 encoding, casefold works incorrectly
on characters in range \u0080 to \u00ff (latin1 supplement).

Before fix:

"\u00c2" [\u00e0-\u00e5] returns false
"\u00c2" [\u00e2] returns false
"\u00c2" \u00e2 returns true

As a Ruby example:

$ ruby -v -e 'p(/[\u00e0-\u00e5]/i =~ "\u00c2")'
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
nil

The PR above was rebased in https://github.com/jruby/joni/pull/85. When that patch is incorporated into JRuby, it behaves as expected:

$ ruby -v -e 'p(/[\u00e0-\u00e5]/i =~ "\u00c2")'
jruby 10.0.0.0-SNAPSHOT (3.4.0) 2025-02-27 c0e5008419 OpenJDK 64-Bit Server VM 21.0.5+11-LTS on 21.0.5+11-LTS +indy +jit [arm64-darwin]
0

This bug may affect other casefold situations. As I am not very familiar with this code in CRuby (other than from maintaining the port in joni) I would like others to evaluate this fix and help find other places it is needed.

Updated by nobu (Nobuyoshi Nakada) 6 days ago

Sounds like same as #16145.

Updated by headius (Charles Nutter) 3 days ago

@nobu (Nobuyoshi Nakada) Certainly could be and the fix looks similar.

@mjrzasa What do you think? Does your fix repair this problem?

Actions

Also available in: Atom PDF

Like0
Like0Like0