Actions

Copy link

Bug #16145

closed

regexp match error if mixing /i, character classes, and utf8

Added by zenspider (Ryan Davis) almost 6 years ago. Updated 5 months ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

Backport:

2.5: UNKNOWN, 2.6: UNKNOWN

[ruby-core:94786]

Tags:

regexp

Description

(reported on behalf of mage@mage.gold -- there appears to be an error in registration or login):

See: ruby-talk @ X-Mail-Count: 440336

2.6.3 :049 > 'SHOP' =~ /[xo]/i
=> 2
2.6.3 :050 > 'CAFÉ' =~ /[é]/i
=> 3
2.6.3 :051 > 'CAFÉ' =~ /[xé]/i
=> nil
2.6.3 :052 > 'CAFÉ' =~ /[xÉ]/i
=> 3

Expected result:
2.6.3 :051 > 'CAFÉ' =~ /[xé]/i
=> 3

I tested it on random regex online pages.

It does not match on https://regex101.com/

It matches on:

https://regexr.com/
https://www.regextester.com/
https://www.freeformatter.com/regex-tester.html

(Ignore case turned on).

The reason I suppose it’s more like a bug than a feature is the fact that /[é]/i matches 'CAFÉ'. If the //i didn’t work for UTF-8 characters then the /[é]/i wouldn’t match it either. For example, [é] does not match 'CAFÉ' on https://regex101.com/

I could not find a page or a system that behaves the same way as Ruby does. For example, it matches in PostgreSQL 10 (under FreeBSD 12) too:

select 'CAFÉ'~ '[xé]';¶

?column?¶

f
(1 row)

select 'CAFÉ' ~* '[xé]';¶

?column?¶

t
(1 row)

Tested it in IRB on macOS and FreeBSD.

$ uname -a && ruby -v && locale
Darwin xxx 18.7.0 Darwin Kernel Version 18.7.0: Thu Jun 20 18:42:21 PDT 2019; root:xnu-4903.270.47~4/RELEASE_X86_64 x86_64
ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

$ uname -a && ruby -v && locale
FreeBSD xxx 12.0-RELEASE-p9 FreeBSD 12.0-RELEASE-p9 GENERIC amd64
ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-freebsd12.0]
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I installed Ruby with RVM.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

#1 [ruby-core:94794]

Updated by duerst (Martin Dürst) almost 6 years ago

Definitely a bug. Confirmed on master (ruby -v
ruby 2.7.0dev (2019-07-06T03:43:38Z trunk f296c260ef) [x86_64-cygwin])

"CAFÉ" =~ /x|é/i
works. So that may be an alternative until this is fixed. It may also give some hints on where the bug comes from. My current guess is that single-character character classes get reduced to just the actual character, so that's why they work.

Actions

Copy link

#2 [ruby-core:115374]

Updated by sancheta867 (Steven Ancheta) over 1 year ago

Confirmed that it's still happening on ruby -v ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin23]

irb(main):001:0> 'CAFÉ' =~ /[xé]/i
=> nil
irb(main):002:0> "CAFÉ" =~ /x|é/i
=> 3

Actions

Copy link

#3 [ruby-core:115537]

Updated by zenspider (Ryan Davis) over 1 year ago

@duerst (Martin Dürst) I don't think your intuition about the character classes is correct:

"CAFÉ" =~ /[a]/i
# => 1

Actions

Copy link

#4 [ruby-core:115581]

Updated by duerst (Martin Dürst) over 1 year ago

@zenspider (Ryan Davis) I said that single-character character classes get reduced to just the actual character. So that would mean that your "CAFÉ" =~ /[a]/i gets reduced to "CAFÉ" =~ /a/i, and therefore works. That of course does not prove my guess, but it also doesn't disprove it. We'd need some other examples to test this further.

Actions

Copy link

#5 [ruby-core:120938]

Updated by mjrzasa (Maciek Rząsa) 6 months ago

I've tested it for Polish letters, the bug appears only for ó, all other work OK:

pry(main)> ['ą', 'ę', 'ó', 'ś', 'ł', 'ć', 'ź', 'ż', 'ń'].map { [_1, _1.bytes, /[x#{_1}]/i.match?("qwer#{_1.capitalize}")]  }
=> [["ą", [196, 133], true],
 ["ę", [196, 153], true],
 ["ó", [195, 179], false],
 ["ś", [197, 155], true],
 ["ł", [197, 130], true],
 ["ć", [196, 135], true],
 ["ź", [197, 186], true],
 ["ż", [197, 188], true],
 ["ń", [197, 132], true]]

ó, like é starts with a byte of 195