Project

General

Profile

Actions

Bug #17990

open

Inconsistent behavior of Regexp quantifiers over characters with complex case foldings

Added by jirkamarsik (Jirka Marsik) almost 3 years ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
[ruby-core:104276]

Description

With case insensitive Regexps, the string "ff" is considered equal to the string "\ufb00" with a single ligature character.

irb(main):001:0> /ff/i.match("\ufb00")
=> #<MatchData "ff">

This behavior also persists when the string "ff" doesn't appear literally in the Regexp source but is expressed using a fixed-length quantifier, as in the following:

irb(main):002:0> /f{2}/i.match("\ufb00")
=> #<MatchData "ff">
irb(main):003:0> /f{2,2}/i.match("\ufb00")
=> #<MatchData "ff">

However, this doesn't hold in general. When using other quantifiers, the ligature character "\ufb00" is not recognized a sequence of two "f" characters.

irb(main):004:0> /f*/i.match("\ufb00")
=> #<MatchData "">
irb(main):005:0> /f+/i.match("\ufb00")
=> nil
irb(main):006:0> /f{1,}/i.match("\ufb00")
=> nil
irb(main):007:0> /f{1,2}/i.match("\ufb00")
=> nil
irb(main):008:0> /f{,2}/i.match("\ufb00")
=> #<MatchData "">
irb(main):009:0> /ff?/i.match("\ufb00")
=> nil

This leads to inconsistent behavior where a Regexp like /f{1,2}/i matches fewer strings than the more strict Regexp /f{2,2}/i.

I suspect that this is caused by the pattern analyzer directly expanding /f{2}/i and /f{2,2}/i into /ff/i. However, this optimization then changes the semantics of the Regexp, as it is otherwise impossible to match a single ligature character via multiple repetitions of a quantified expression.

While experimenting with this case, I have also discovered a related issue (caused by the problematic expansions of /f{n}/i and the issue reported here: https://bugs.ruby-lang.org/issues/17989).

These match:

/f{100}/i.match("f" * 100)
/f{100}/i.match("\ufb00" * 50)
/f{100}/i.match("\ufb00" * 49 + "ff")
/f{100}/i.match("ff" + "\ufb00" * 49)

However, this doesn't match:

/f{100}/i.match("f" + "\ufb00" * 49 + "f")

No data to display

Actions

Also available in: Atom PDF

Like0