Project

General

Profile

Actions

Bug #16145

open

regexp match error if mixing /i, character classes, and utf8

Added by zenspider (Ryan Davis) over 5 years ago. Updated 1 day ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:94786]
Tags:

Description

(reported on behalf of -- there appears to be an error in registration or login):

See: ruby-talk @ X-Mail-Count: 440336

2.6.3 :049 > 'SHOP' =~ /[xo]/i
=> 2
2.6.3 :050 > 'CAFÉ' =~ /[é]/i
=> 3
2.6.3 :051 > 'CAFÉ' =~ /[xé]/i
=> nil
2.6.3 :052 > 'CAFÉ' =~ /[xÉ]/i
=> 3

Expected result:
2.6.3 :051 > 'CAFÉ' =~ /[xé]/i
=> 3

I tested it on random regex online pages.

It does not match on https://regex101.com/

It matches on:

https://regexr.com/
https://www.regextester.com/
https://www.freeformatter.com/regex-tester.html

(Ignore case turned on).

The reason I suppose it’s more like a bug than a feature is the fact that /[é]/i matches 'CAFÉ'. If the //i didn’t work for UTF-8 characters then the /[é]/i wouldn’t match it either. For example, [é] does not match 'CAFÉ' on https://regex101.com/

I could not find a page or a system that behaves the same way as Ruby does. For example, it matches in PostgreSQL 10 (under FreeBSD 12) too:

select 'CAFÉ'~ '[xé]';

?column?

f
(1 row)

select 'CAFÉ' ~* '[xé]';

?column?

t
(1 row)

Tested it in IRB on macOS and FreeBSD.

$ uname -a && ruby -v && locale
Darwin xxx 18.7.0 Darwin Kernel Version 18.7.0: Thu Jun 20 18:42:21 PDT 2019; root:xnu-4903.270.47~4/RELEASE_X86_64 x86_64
ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-darwin18]
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

$ uname -a && ruby -v && locale
FreeBSD xxx 12.0-RELEASE-p9 FreeBSD 12.0-RELEASE-p9 GENERIC amd64
ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-freebsd12.0]
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I installed Ruby with RVM.

Updated by duerst (Martin Dürst) over 5 years ago

Definitely a bug. Confirmed on master (ruby -v
ruby 2.7.0dev (2019-07-06T03:43:38Z trunk f296c260ef) [x86_64-cygwin])

"CAFÉ" =~ /x|é/i
works. So that may be an alternative until this is fixed. It may also give some hints on where the bug comes from. My current guess is that single-character character classes get reduced to just the actual character, so that's why they work.

Updated by sancheta867 (Steven Ancheta) over 1 year ago

Confirmed that it's still happening on ruby -v ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin23]

irb(main):001:0> 'CAFÉ' =~ /[xé]/i
=> nil
irb(main):002:0> "CAFÉ" =~ /x|é/i
=> 3

Updated by zenspider (Ryan Davis) about 1 year ago

@duerst (Martin Dürst) I don't think your intuition about the character classes is correct:

"CAFÉ" =~ /[a]/i
# => 1

Updated by duerst (Martin Dürst) about 1 year ago

@zenspider (Ryan Davis) I said that single-character character classes get reduced to just the actual character. So that would mean that your "CAFÉ" =~ /[a]/i gets reduced to "CAFÉ" =~ /a/i, and therefore works. That of course does not prove my guess, but it also doesn't disprove it. We'd need some other examples to test this further.

Updated by mjrzasa (Maciek Rząsa) 10 days ago

I've tested it for Polish letters, the bug appears only for ó, all other work OK:

pry(main)> ['ą', 'ę', 'ó', 'ś', 'ł', 'ć', 'ź', 'ż', 'ń'].map { [_1, _1.bytes, /[x#{_1}]/i.match?("qwer#{_1.capitalize}")]  }
=> [["ą", [196, 133], true],
 ["ę", [196, 153], true],
 ["ó", [195, 179], false],
 ["ś", [197, 155], true],
 ["ł", [197, 130], true],
 ["ć", [196, 135], true],
 ["ź", [197, 186], true],
 ["ż", [197, 188], true],
 ["ń", [197, 132], true]]

ó, like é starts with a byte of 195

pry(main)> 'é'.bytes
=> [195, 169]
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0