Bug #21162: Regexp casefold mismatch for latin1 supplemental chars - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #21162

closed

Regexp casefold mismatch for latin1 supplemental chars

Bug #21162: Regexp casefold mismatch for latin1 supplemental chars

Added by headius (Charles Nutter) about 1 year ago. Updated about 1 year ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

Backport:

3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN

[ruby-core:121201]

Description

Originally reported to joni repo with a possible fix here: https://github.com/jruby/joni/pull/20

From that PR:

When a character is less than or equal to single byte size (0xff),
yet it takes more than 1 byte in the current encoding, the
case folding code incorrectly put it in bitset instead of code
range. As a result, for utf8 encoding, casefold works incorrectly
on characters in range \u0080 to \u00ff (latin1 supplement).

Before fix:

"\u00c2" [\u00e0-\u00e5] returns false
"\u00c2" [\u00e2] returns false
"\u00c2" \u00e2 returns true

As a Ruby example:

$ ruby -v -e 'p(/[\u00e0-\u00e5]/i =~ "\u00c2")'
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
nil

The PR above was rebased in https://github.com/jruby/joni/pull/85. When that patch is incorporated into JRuby, it behaves as expected:

$ ruby -v -e 'p(/[\u00e0-\u00e5]/i =~ "\u00c2")'
jruby 10.0.0.0-SNAPSHOT (3.4.0) 2025-02-27 c0e5008419 OpenJDK 64-Bit Server VM 21.0.5+11-LTS on 21.0.5+11-LTS +indy +jit [arm64-darwin]
0

This bug may affect other casefold situations. As I am not very familiar with this code in CRuby (other than from maintaining the port in joni) I would like others to evaluate this fix and help find other places it is needed.

Related issues 1 (0 open — 1 closed)

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#1 [ruby-core:121204]

Sounds like same as #16145.

Updated by headius (Charles Nutter) about 1 year ago Actions
Copy link
#2 [ruby-core:121225]

@nobu (Nobuyoshi Nakada) Certainly could be and the fix looks similar.

@mjrzasa (Maciek Rząsa) What do you think? Does your fix repair this problem?

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#3

Is duplicate of Bug #16145: regexp match error if mixing /i, character classes, and utf8 added

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#4 [ruby-core:121255]

Status changed from Open to Closed

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #21162

Regexp casefold mismatch for latin1 supplemental chars

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#1 [ruby-core:121204]

Updated by headius (Charles Nutter) about 1 year ago Actions
Copy link
#2 [ruby-core:121225]

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#3

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#4 [ruby-core:121255]

Project

General

Profile

Ruby

Custom queries

Bug #21162

Regexp casefold mismatch for latin1 supplemental chars

Updated by nobu (Nobuyoshi Nakada) about 1 year ago ActionsCopy link #1 [ruby-core:121204]

Updated by headius (Charles Nutter) about 1 year ago ActionsCopy link #2 [ruby-core:121225]

Updated by nobu (Nobuyoshi Nakada) about 1 year ago ActionsCopy link #3

Updated by nobu (Nobuyoshi Nakada) about 1 year ago ActionsCopy link #4 [ruby-core:121255]

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#1 [ruby-core:121204]

Updated by headius (Charles Nutter) about 1 year ago Actions
Copy link
#2 [ruby-core:121225]

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#3

Updated by nobu (Nobuyoshi Nakada) about 1 year ago Actions
Copy link
#4 [ruby-core:121255]