Feature #19725


Improve the match cache optimization to support look-ahead and atomic groups

Added by make_now_just (Hiroya Fujinami) about 1 year ago. Updated 6 months ago.

Target version:


Currently, the Regexp match cache optimization [Feature #19104] memoizes match cache points that once arrived; that is, it memoizes them before backtracking. This kind of implementation is simple and works fine in most cases. However, in some cases, it does not work well. For example,

  1. it is hard to preserve the strange null-loop behavior, and
  2. it is hard to support look-around operators and atomic groups correctly.

The first problem is simple. Memoization may change the matching behavior if a cache point in a null loop is memoized because Onigmo allows a null loop (a loop without consuming a character) at the first time to keep captures. We avoid this problem by carefully resetting the memoization table on null loops currently. But, nested null loops are not supported.

The second problem is a bit complex. For example, /\Aa*(?=a*)a\z/ does not match against "a" on the memozation before backtracking. The reason is that when the first matching of the look-ahead is succeeded, the loop branch in the look-ahead is memoized, and it is impossible to succeed the second matching of the look-ahead. Furthermore, how do we prevent increasing the matching time of /\Aa*?(?=a*)\z/ against "a" * N + "z" quadratically?

I suggest two improvements to resolve these issues:

  1. Memoizing after backtracking, and
  2. Memoizing partial match results for match cache points in look-ahead and atomic groups.

The first improvement is, I think, the canonical way to implement the match cache optimization. This avoids the first "null loop" problem entirely. This also enables to support look-ahead and atomic groups easily. This can be implemented by introducing an extra item to the stack and doing memoization on popping it.

The second improvement is to simulate on the memoization the atomic behavior of look-ahead and atomic (Of course!) groups on the memoization. The current memoization table only records the arrived points; that is, it does not record the result of matching and only records failures of matching. Actually, it is not problematic in the current implementation because "matching is succeeded" means the matching is done. However, when a part of look-around and atomic groups is matched, the stack is hoisted to behave atomically. To simulate this behavior, successes in matching these parts are also memoized. This can be implemented by introducing an extra bit for the result of matching to the memoization table.

These improvements will enable

  • to avoid the nested null loop restriction, and
  • to support positive/negative look-ahead and atomic groups correctly.

However, look-ahead and atomic groups that include captures cannot be supported. This is because captures may be changed on successful matches.

Actions #2

Updated by mame (Yusuke Endoh) 6 months ago

  • Status changed from Open to Closed

Also available in: Atom PDF