Bug #21859
closedInconsistent behaviors in Regexp lookbehind/lookahead with capture groups
Description
First issue: Regexp.linear_time? is false for a positive lookahead with a capture, but true for a positive lookbehind:
irb(main):002> Regexp.linear_time?(/(?=(a))/)
=> false
irb(main):003> Regexp.linear_time?(/(?<=(a))/)
=> true
This should be false in both cases.
Second issue: Capture group is allowed in a negative lookahead, but causes a SyntaxError in a negative lookbehind:
irb(main):001> /(?!(a))b/
=> /(?!(a))b/
irb(main):002> /(?<!(a))b/
/home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError)
I believe such a capture group can never capture anything, so it probably should be an error in both cases.
Updated by tompng (tomoya ishida) 20 days ago
First issue
This should be false in both cases.
I think Regexp.linear_time?(/(?<=(a))/) matches in linear time.
If the issue is just for inconsistency between lookahead and lookbehind, it's not a bug.
Here's an example:
Regexp.linear_time?(/x+(?=(a))/) #=> false
Regexp.linear_time?(/x+(?<=(a))/) #=> true
/x+(?=(a))/.match?('x' * 100000) #=> processing time: 28.599804s not linear_time
/x+(?<=(a))/.match?('x' * 100000) #=> processing time: 0.016630s linear_time
Second issue: /(?!(a))b/ /(?<!(a))b/
I believe such a capture group can never capture anything
Capture group in negative lookahead can capture and can be used inside negative lookahead.
For negative lookbehind, I think it's just a restriction of onigmo.
regexp = /(?!([a-z])\1)[a-z]{2}/
regexp.match?('ab') #=> true
regexp.match?('aa') #=> false
Updated by trinistr (Alexander Bulancov) 20 days ago
I think
Regexp.linear_time?(/(?<=(a))/)matches in linear time.
I apologize, it seems I got distracted and forgot to actually check the execution times.
But this is interesting behavior. Maybe constant-size lookahead can be optimized to also be linear? It seems strange to me that these cases are so similar but behave very differently.
Capture group in negative lookahead can capture and can be used inside negative lookahead.
I've not been able to find a case where the capture group actually captures, not just overall regexp matches. Isn't it impossible? To match, regexp needs to satisfy negative lookahead, so there should not be anything to capture.
regexp = /(?!([a-z])\1)[a-z]{2}/
regexp.match('ab') # => #<MatchData "ab" 1:nil>
regexp.match('aabaa') # => #<MatchData "ab" 1:nil>
regexp = /[a-z]{2}(?!([a-z])\1)/
regexp.match('aabaa') # => #<MatchData "aa" 1:nil>
Updated by tompng (tomoya ishida) 20 days ago
Isn't it impossible? To match, regexp needs to satisfy negative lookahead, so there should not be anything to capture.
As you wrote, captures are not available OUTSIDE of negative lookahead and also in MatchData.
But in /(?!([a-z])\1)[a-z]{2}/, \1 is actually using the capture. It's available INSIDE negative lookahead. As a result, "aa" that matches ([a-z])\1 doesn't match to the overall regexp. So capture in negative lookahead is a valid and meaningful pattern.
Updated by trinistr (Alexander Bulancov) 20 days ago
Thank you, I understand now what you meant.
Should this issue be changed to a feature request?
Updated by Eregon (Benoit Daloze) 20 days ago
- Status changed from Open to Closed
An interesting fact is TruffleRuby/TRegex seems to report exactly the opposite for these 4 Regexp whether they are linear:
truffleruby 33.0.1 (2026-01-20), like ruby 3.3.7, Oracle GraalVM Native [x86_64-linux]
irb(main):001> Regexp.linear_time?(/(?=(a))/)
=> true
irb(main):002> Regexp.linear_time?(/(?<=(a))/)
=> false
irb(main):003> Regexp.linear_time?(/x+(?=(a))/)
=> true
irb(main):004> Regexp.linear_time?(/x+(?<=(a))/)
=> false
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
irb(main):001> Regexp.linear_time?(/(?=(a))/)
=> false
irb(main):002> Regexp.linear_time?(/(?<=(a))/)
=> true
irb(main):003> Regexp.linear_time?(/x+(?=(a))/)
=> false
irb(main):004> Regexp.linear_time?(/x+(?<=(a))/)
=> true
I think that means it depends a lot on the specifics of the Regexp engine implementation.
I made a summary back then in https://bugs.ruby-lang.org/issues/19104#note-3
(FWIW I think /x+(?<=(a))/ can never match, it would need to match both 'x' and 'a' for the same input character)
Should this issue be changed to a feature request?
I think we should close this rather.
And if you want something specific and have a use for it then filing a new feature request is best.
FWIW I saw on 4.0.1 Regexp.linear_time?(/(?=a)/) is true but Regexp.linear_time?(/(?=(a))/) is false.
I don't think capture groups in lookahead or lookbehind are common though.
Updated by Eregon (Benoit Daloze) 19 days ago
Useful context for this issue which would make sense to add the description is this Regexp item from the NEWS of 3.3:
https://github.com/ruby/ruby/blob/master/doc/NEWS/NEWS-3.3.0.md
The cache-based optimization now supports lookarounds and atomic groupings. That is, match for Regexp containing these extensions can now also be performed in linear time to the length of the input string. However, these cannot contain captures and cannot be nested. [Feature #19725]
Updated by trinistr (Alexander Bulancov) 18 days ago
Useful context for this issue which would make sense to add the description is this Regexp item from the NEWS of 3.3
Yes, thank you, that's what lead to me making the wrong assumption about linearity of lookbehind.
I don't think capture groups in lookahead or lookbehind are common though.
Suprisingly, there is a smattering of capturing lookaheads in Ruby distribution (for example, ext/extmk and optparse), though definitely not common.
I think we should close this rather.
Good enough for me.