Project

General

Profile

Actions

Bug #20315

open

Quantifier expansion leads to different results in Regexp.

Added by jirkamarsik (Jirka Marsik) 9 months ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]
[ruby-core:117013]

Description

Consider the following series of regular expression matches:

irb(main):001:0> /(|a)(|a)(|a)(|a)(|a)b/.match("aaaab")
=> #<MatchData "aaaab" 1:"" 2:"a" 3:"a" 4:"a" 5:"a">
irb(main):002:0> /(|a)(|a)(|a)(|a)b/.match("aaab")
=> #<MatchData "aaab" 1:"" 2:"a" 3:"a" 4:"a">
irb(main):003:0> /(|a)(|a)(|a)b/.match("aab")
=> #<MatchData "aab" 1:"" 2:"a" 3:"a">
irb(main):004:0> /(|a)(|a)b/.match("ab")
=> #<MatchData "ab" 1:"" 2:"a">

Let X^{N} mean N concatenated repetitions of X. When matching the pattern /(|a)^{N}b/ against a^{N-1}b, the first group will match the empty string and the last N-1 groups will match a.

Now, let's look at this series of similar expressions, in which (|a)^{N} is replaced with (|a){N}, i.e. a counted quantifier.

irb(main):001:0> /(|a){5}b/.match("aaaab")
=> #<MatchData "aaaab" 1:"">
irb(main):002:0> /(|a){4}b/.match("aaab")
=> #<MatchData "aaab" 1:"">
irb(main):003:0> /(|a){3}b/.match("aab")
=> #<MatchData "aab" 1:"">
irb(main):004:0> /(|a){2}b/.match("ab")
=> #<MatchData "ab" 1:"a">

When matching the pattern /(|a){N}b/ against a^{N-1}b, the first N-1 iterations will match a and the N-th iteration will match the empty string (compare this with the behavior of the first series of expressions). However, something strange happens when N is 2. We end up getting a result which is not consistent with this series, but looks like a result that belongs to the first series discussed above.

This is due to quantifier expansion done by the regexp compiler (see usages of QUANTIFIER_EXPAND_LIMIT_SIZE in regcomp.c). This is an optimization that tries to remove the overhead of managing a counted repetition during regex execution at the cost of increasing the size of the compiled regex bytecode. The source of the inconsistency is caused by the fact that this optimization can actually change the semantics of the regular expression, because as we have seen above, X^{N} (X repeated N times) can have different semantics from X{N} (a single X with an {N} quantifier).

Compilation and execution of /(|a){3}b/.match("aab") with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH:

PATTERN: /(|a){3}b/ (US-ASCII)
<list:556161bb11f0>
   <quantifier:556161bb1230>{3,3}
      <enclose:556161bb1330> memory:1
         <alt:556161bb11b0>
            <string:556161bb12f0>
            <string:556161bb12b0>a
   <string:556161bb13b0>b
optimize: EXACT
  anchor: []
  sub anchor: []

exact: [b]: length: 1
code length: 37
0:[repeat:0:27] 7:[null-check-start:0] 10:[mem-start:1] 13:[push:(+5)] 18:[jump:(+2)]
23:[exact1:a] 25:[mem-end:1] 28:[null-check-end-memst:0] 31:[repeat-inc:0] 34:[exact1:b]
36:[end]
match_at: str: 140607416406704 (0x7fe1b71b92b0), end: 140607416406706 (0x7fe1b71b92b2), start: 140607416406704 (0x7fe1b71b92b0), sprev: 0 ((nil))
size: 2, start offset: 0

 ofs> str                   stk:type   addr:opcode
   0> "ab"                    0:Alt       0:[repeat:0:27]
   0> "ab"                    1:Rep       7:[null-check-start:0]
   0> "ab"                    2:NulChS   10:[mem-start:1]
   0> "ab"                    2:NulChS   13:[push:(+5)]
   0> "ab"                    3:Alt      18:[jump:(+2)]
   0> "ab"                    3:Alt      25:[mem-end:1]
   0> "ab"                    3:Alt      28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip  id:0, s:140607416406704 (0x7fe1b71b92b0)
   0> "ab"                    3:Alt      34:[exact1:b]
   0> "ab"                    2:NulChS   23:[exact1:a]
   1> "b"                     2:NulChS   25:[mem-end:1]
   1> "b"                     2:NulChS   28:[null-check-end-memst:0]
   1> "b"                     2:NulChS   31:[repeat-inc:0]
   1> "b"                     3:RepInc    7:[null-check-start:0]
   1> "b"                     4:NulChS   10:[mem-start:1]
   1> "b"                     4:NulChS   13:[push:(+5)]
   1> "b"                     5:Alt      18:[jump:(+2)]
   1> "b"                     5:Alt      25:[mem-end:1]
   1> "b"                     5:Alt      28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip  id:0, s:140607416406705 (0x7fe1b71b92b1)
   1> "b"                     5:Alt      34:[exact1:b]
   2> ""                      5:Alt      36:[end]

Compilation and execution of /(|a){2}b/.match("ab") with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH:

PATTERN: /(|a){2}b/ (US-ASCII)
<list:55bbc826c1f0>
   <quantifier:55bbc826c230>{2,2}
      <enclose:55bbc826c330> memory:1
         <alt:55bbc826c1b0>
            <string:55bbc826c2f0>
            <string:55bbc826c2b0>a
   <string:55bbc826c3b0>b
optimize: EXACT
  anchor: []
  sub anchor: []

exact: [b]: length: 1
code length: 39
0:[mem-start:1] 3:[push:(+5)] 8:[jump:(+2)] 13:[exact1:a] 15:[mem-end:1]
18:[mem-start:1] 21:[push:(+5)] 26:[jump:(+2)] 31:[exact1:a] 33:[mem-end:1]
36:[exact1:b] 38:[end]
match_at: str: 140139875963504 (0x7f74db869270), end: 140139875963506 (0x7f74db869272), start: 140139875963504 (0x7f74db869270), sprev: 0 ((nil))
size: 2, start offset: 0

 ofs> str                   stk:type   addr:opcode
   0> "ab"                    0:Alt       0:[mem-start:1]
   0> "ab"                    0:Alt       3:[push:(+5)]
   0> "ab"                    1:Alt       8:[jump:(+2)]
   0> "ab"                    1:Alt      15:[mem-end:1]
   0> "ab"                    1:Alt      18:[mem-start:1]
   0> "ab"                    1:Alt      21:[push:(+5)]
   0> "ab"                    2:Alt      26:[jump:(+2)]
   0> "ab"                    2:Alt      33:[mem-end:1]
   0> "ab"                    2:Alt      36:[exact1:b]
   0> "ab"                    1:Alt      31:[exact1:a]
   1> "b"                     1:Alt      33:[mem-end:1]
   1> "b"                     1:Alt      36:[exact1:b]
   2> ""                      1:Alt      38:[end]

No data to display

Actions

Also available in: Atom PDF

Like0