Bug #17341
closedUnsound quantifier reduction with nested quantifiers
Description
The rules for reducing nested quantifiers can produce quantifiers with semantics which differ from the original quantifiers. This can then lead to the regular expressions matching different strings.
irb(main):001:0> /(?:a+?)*/.match('aa')
(irb):1: warning: nested repeat operator '+?' and '*' was replaced with '+? and ?' in regular expression: /(?:a+?)*/
=> #<MatchData "a">
irb(main):002:0> /(a+?)*/.match('aa')
=> #<MatchData "aa" 1:"a">
In the above, we can see that by inserting a capture group between the two quantifiers, we prevent quantifier reduction from occurring and we get a regexp that matches the whole input. If we let quantifier reduction happen, we get a resulting regexp that only matches the first character. I think quantifier reduction should not change the behavior of a regexp, as it is just an optimization.
I found the quantifier reduction rules in ReduceTypeTable
in regparse.c
. I haven't checked them all but the ones that replace two quantifiers by two other quantifiers caught my eye.