Bug #13671
Regexp with lookbehind and case-insensitivity raises RegexpError only on strings with certain characters
Description
Here is a test program:
def test(description)
begin
yield
puts "#{description} is OK"
rescue RegexpError
puts "#{description} raises RegexpError"
end
end
test("ass, case-insensitive, special") { /(?<!ass)/i =~ '✨' }
test("bss, case-insensitive, special") { /(?<!bss)/i =~ '✨' }
test("as, case-insensitive, special") { /(?<!as)/i =~ '✨' }
test("ss, case-insensitive, special") { /(?<!ss)/i =~ '✨' }
test("ass, case-sensitive, special") { /(?<!ass)/ =~ '✨' }
test("ass, case-insensitive, regular") { /(?<!ass)/i =~ 'x' }
Running the test program with Ruby 2.4.1 (macOS) gives
ass, case-insensitive, special raises RegexpError bss, case-insensitive, special raises RegexpError as, case-insensitive, special is OK ss, case-insensitive, special is OK ass, case-sensitive, special is OK ass, case-insensitive, regular is OK
The RegexpError is "invalid pattern in look-behind: /(?<!ass)/i (RegexpError)"
Side note: in the real code in which I found this error I was able to work around the error by using (?i) after the lookbehind instead of //i.
Running the test program with Ruby 2.3.4 does not report any RegexpErrors.
I think this is a regression, although I might be wrong and it might be saving me from an incorrect result with certain strings.
Files
Related issues
Updated by Hanmac (Hans Mackowiak) over 3 years ago
did some checks on my windows system to check how deep the problem is.
i used "ä" as variable.
the same problem happens when you try to use match function too:
/(?<!ass)/i.match('ä')
also happen for
Regexp.union(/(?<!ass)/i, /ä/)
but i still don't understand why it does crash with ass, while ss works.
might have something todo how regexp are stored internal
Updated by naruse (Yui NARUSE) over 3 years ago
I created a ticket in upstream: https://github.com/k-takata/Onigmo/issues/92
Updated by gotoken (Kentaro Goto) over 2 years ago
I encountered a non ss
case. Is this a same problem?
% ruby -ve '"".match(/(?<=ast)/ui)' ruby 2.6.0dev (2018-08-27 trunk 64549) [x86_64-linux] -e:1: invalid pattern in look-behind: /(?<=ast)/i
It was reproduced in version 2.4 and 2.5.
#14838 seems to be duplicate.
Updated by znz (Kazuhiro NISHIYAMA) over 2 years ago
You can use (?:s)
instead of s
for workaround.
$ ruby -ve '/(?<=ast)/iu' ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17] -e:1: invalid pattern in look-behind: /(?<=ast)/i -e:1: warning: possibly useless use of a literal in void context $ ruby -ve '/(?<=a(?:s)t)/iu' ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17] -e:1: warning: possibly useless use of a literal in void context
Updated by znz (Kazuhiro NISHIYAMA) over 2 years ago
- Related to Bug #14838: RegexpError with double "s" in look-behind assertion in case-insensitive unicode regexp added
Updated by gotoken (Kentaro Goto) over 2 years ago
Thanks znz. The workaround is helpful. And I understood what was happened.
https://github.com/k-takata/Onigmo/issues/92#issuecomment-373981492 shows how some combinations of letters are variable length.
For example, "ss"
and "st"
are mapped "ß"
("\u00DF"
) and "st"
("\uFB06"
).
Those combinations are listed in ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt
By the way, this expansion by //i
option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...
Updated by shyouhei (Shyouhei Urabe) over 2 years ago
gotoken (Kentaro Goto) wrote:
By the way, this expansion by
//i
option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...
I know how you feel. Too bad we are just doing what Unicode specifies to do.
Updated by gotoken (Kentaro Goto) over 2 years ago
Thanks shyouhei for your pointing out.
I imagine another Rexexp option, say //I
, which is almost the same as //i
except for never-applying SpecialCasing mapping.
This change extends Unicode matching indeed but does not introduce incompatibilities, IMHO.
A difficulty is the implementation is on the upstream library and cruby is just a user.
Updated by duerst (Martin Dürst) over 2 years ago
gotoken (Kentaro Goto) wrote:
For example,
"ss"
and"st"
are mapped"ß"
("\u00DF"
) and"st"
("\uFB06"
).
Those combinations are listed in ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txtBy the way, this expansion by
//i
option looks over kill for me.
I wish case sensitivity and SpecialCasing mapping were separated...
I still have to verify this, but currently I strongly suspect that the problem is NOT in SpecialCasing, but in how Onigmo (/Oniguruma?) implement it.
Updated by mauromorales (Mauro Morales) 10 months ago
FYI The issue has been addressed in Onigmo https://github.com/k-takata/Onigmo/pull/116 and has already been released in version 6.2.0. I tried it by applying the changes using Ruby 2.6.6 and it works as expected.
Updated by mauromorales (Mauro Morales) 9 days ago
Unfortunately, the problem persists in Ruby 2.7.2 and 3.0.0