Bug #20316


Regexp quantifier behavior changes in different syntactic context.

Added by jirkamarsik (Jirka Marsik) 5 months ago.

Target version:
ruby -v:
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu]


In the example below, adding a vertical bar to the end of a regular expression changes what is being matched by the preceding regular expression.

irb(main):001:0> /(|a){3}b/.match("aab")
=> #<MatchData "aab" 1:"">
irb(main):002:0> /(|a){3}b|/.match("aab")
=> #<MatchData "aab" 1:"a">

This is because the {3} quantifier is compiled into a repeat loop which uses the OP_NULL_CHECK_END_MEMST operation to perform a capture-group sensitive null-check after every loop iteration. The logic behind the OP_NULL_CHECK_END_MEMST operation is implemented using the STACK_NULL_CHECK_MEMST macro in regexec.c. The STACK_NULL_CHECK_MEMST macro checks whether any capture groups have been updated inside the last loop iteration and it does so by searching the stack for STK_MEM_START entries. However, such entries are not used for all capture groups. They are only used by capture groups which are listed in bt_mem_start. A capture group is marked as needing such bookkeeping only when it is either referenced by a back-reference or it appears in certain syntactic contexts (see e.g. around line 4096 of regcomp.c). This looks like an optimization that tries to avoid polluting the stack with STK_MEM_START entries in cases in which they are not needed. However, in this case, not putting STK_MEM_START entries on the stack leads to a different match result.

In the example above, by adding a vertical bar to the end of the regexp, we have placed the group (|a) inside an alternation. This means that a different operation for tracking the state of the capture group is emitted in the compiled bytecode and this leads to a different result. Incidentally, this result should be the correct result, as the null-check ends up respecting the state of capture groups, as it tries to do in Ruby.

This is the compilation and execution of /(|a){3}b/.match("aab") with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH. Note that mem-start:1 is used for tracking the state of capture group 1.

PATTERN: /(|a){3}b/ (US-ASCII)
      <enclose:55db92999330> memory:1
optimize: EXACT
  anchor: []
  sub anchor: []

exact: [b]: length: 1
code length: 37
0:[repeat:0:27] 7:[null-check-start:0] 10:[mem-start:1] 13:[push:(+5)] 18:[jump:(+2)]
23:[exact1:a] 25:[mem-end:1] 28:[null-check-end-memst:0] 31:[repeat-inc:0] 34:[exact1:b]
match_at: str: 139809364021904 (0x7f27e77a9290), end: 139809364021906 (0x7f27e77a9292), start: 139809364021904 (0x7f27e77a9290), sprev: 0 ((nil))
size: 2, start offset: 0

 ofs> str                   stk:type   addr:opcode
   0> "ab"                    0:Alt       0:[repeat:0:27]
   0> "ab"                    1:Rep       7:[null-check-start:0]
   0> "ab"                    2:NulChS   10:[mem-start:1]
   0> "ab"                    2:NulChS   13:[push:(+5)]
   0> "ab"                    3:Alt      18:[jump:(+2)]
   0> "ab"                    3:Alt      25:[mem-end:1]
   0> "ab"                    3:Alt      28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip  id:0, s:139809364021904 (0x7f27e77a9290)
   0> "ab"                    3:Alt      34:[exact1:b]
   0> "ab"                    2:NulChS   23:[exact1:a]
   1> "b"                     2:NulChS   25:[mem-end:1]
   1> "b"                     2:NulChS   28:[null-check-end-memst:0]
   1> "b"                     2:NulChS   31:[repeat-inc:0]
   1> "b"                     3:RepInc    7:[null-check-start:0]
   1> "b"                     4:NulChS   10:[mem-start:1]
   1> "b"                     4:NulChS   13:[push:(+5)]
   1> "b"                     5:Alt      18:[jump:(+2)]
   1> "b"                     5:Alt      25:[mem-end:1]
   1> "b"                     5:Alt      28:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip  id:0, s:139809364021905 (0x7f27e77a9291)
   1> "b"                     5:Alt      34:[exact1:b]
   2> ""                      5:Alt      36:[end]

This is the compilation and execution of /(|a){3}b|/.match("aab") (where the group (|a) appears inside an alternation) with ONIG_DEBUG_PARSE_TREE, ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH. Note that mem-start-push:1 is used for tracking the state of capture group 1, not mem-start:1.

PATTERN: /(|a){3}b|/ (US-ASCII)
         <enclose:55d9a91cc340> memory:1
optimize: NONE
  anchor: []

code length: 47
0:[push:(+41)] 5:[repeat:0:27] 12:[null-check-start:0] 15:[mem-start-push:1] 18:[push:(+5)]
23:[jump:(+2)] 28:[exact1:a] 30:[mem-end:1] 33:[null-check-end-memst:0] 36:[repeat-inc:0]
39:[exact1:b] 41:[jump:(+0)] 46:[end]
match_at: str: 140072854131304 (0x7f6540b69268), end: 140072854131306 (0x7f6540b6926a), start: 140072854131304 (0x7f6540b69268), sprev: 0 ((nil))
size: 2, start offset: 0

 ofs> str                   stk:type   addr:opcode
   0> "ab"                    0:Alt       0:[push:(+41)]
   0> "ab"                    1:Alt       5:[repeat:0:27]
   0> "ab"                    2:Rep      12:[null-check-start:0]
   0> "ab"                    3:NulChS   15:[mem-start-push:1]
   0> "ab"                    4:MemS     18:[push:(+5)]
   0> "ab"                    5:Alt      23:[jump:(+2)]
   0> "ab"                    5:Alt      30:[mem-end:1]
   0> "ab"                    5:Alt      33:[null-check-end-memst:0]
   0> "ab"                    5:Alt      36:[repeat-inc:0]
   0> "ab"                    6:RepInc   12:[null-check-start:0]
   0> "ab"                    7:NulChS   15:[mem-start-push:1]
   0> "ab"                    8:MemS     18:[push:(+5)]
   0> "ab"                    9:Alt      23:[jump:(+2)]
   0> "ab"                    9:Alt      30:[mem-end:1]
   0> "ab"                    9:Alt      33:[null-check-end-memst:0]
NULL_CHECK_END_MEMST: skip  id:0, s:140072854131304 (0x7f6540b69268)
   0> "ab"                    9:Alt      39:[exact1:b]
   0> "ab"                    8:MemS     28:[exact1:a]
   1> "b"                     8:MemS     30:[mem-end:1]
   1> "b"                     8:MemS     33:[null-check-end-memst:0]
   1> "b"                     8:MemS     36:[repeat-inc:0]
   1> "b"                     9:RepInc   12:[null-check-start:0]
   1> "b"                    10:NulChS   15:[mem-start-push:1]
   1> "b"                    11:MemS     18:[push:(+5)]
   1> "b"                    12:Alt      23:[jump:(+2)]
   1> "b"                    12:Alt      30:[mem-end:1]
   1> "b"                    12:Alt      33:[null-check-end-memst:0]
   1> "b"                    12:Alt      36:[repeat-inc:0]
   1> "b"                    13:RepInc   39:[exact1:b]
   2> ""                     13:RepInc   41:[jump:(+0)]
   2> ""                     13:RepInc   46:[end]

No data to display


Also available in: Atom PDF