Project

General

Profile

Actions

Bug #21859

closed

Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups

Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups

Added by trinistr (Alexander Bulancov) 21 days ago. Updated 18 days ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
[ruby-core:124664]

Description

First issue: Regexp.linear_time? is false for a positive lookahead with a capture, but true for a positive lookbehind:

irb(main):002> Regexp.linear_time?(/(?=(a))/)
=> false
irb(main):003> Regexp.linear_time?(/(?<=(a))/)
=> true

This should be false in both cases.

Second issue: Capture group is allowed in a negative lookahead, but causes a SyntaxError in a negative lookbehind:

irb(main):001> /(?!(a))b/
=> /(?!(a))b/
irb(main):002> /(?<!(a))b/
/home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError)

I believe such a capture group can never capture anything, so it probably should be an error in both cases.

Updated by tompng (tomoya ishida) 20 days ago Actions #1 [ruby-core:124665]

First issue

This should be false in both cases.

I think Regexp.linear_time?(/(?<=(a))/) matches in linear time.
If the issue is just for inconsistency between lookahead and lookbehind, it's not a bug.
Here's an example:

Regexp.linear_time?(/x+(?=(a))/) #=> false
Regexp.linear_time?(/x+(?<=(a))/) #=> true

/x+(?=(a))/.match?('x' * 100000) #=> processing time: 28.599804s not linear_time
/x+(?<=(a))/.match?('x' * 100000) #=> processing time: 0.016630s linear_time

Second issue: /(?!(a))b/ /(?<!(a))b/

I believe such a capture group can never capture anything

Capture group in negative lookahead can capture and can be used inside negative lookahead.
For negative lookbehind, I think it's just a restriction of onigmo.

regexp = /(?!([a-z])\1)[a-z]{2}/
regexp.match?('ab') #=> true
regexp.match?('aa') #=> false

Updated by trinistr (Alexander Bulancov) 20 days ago Actions #2 [ruby-core:124666]

I think Regexp.linear_time?(/(?<=(a))/) matches in linear time.

I apologize, it seems I got distracted and forgot to actually check the execution times.
But this is interesting behavior. Maybe constant-size lookahead can be optimized to also be linear? It seems strange to me that these cases are so similar but behave very differently.

Capture group in negative lookahead can capture and can be used inside negative lookahead.

I've not been able to find a case where the capture group actually captures, not just overall regexp matches. Isn't it impossible? To match, regexp needs to satisfy negative lookahead, so there should not be anything to capture.

regexp = /(?!([a-z])\1)[a-z]{2}/
regexp.match('ab') # => #<MatchData "ab" 1:nil>
regexp.match('aabaa') # => #<MatchData "ab" 1:nil>

regexp = /[a-z]{2}(?!([a-z])\1)/
regexp.match('aabaa') # => #<MatchData "aa" 1:nil>

Updated by tompng (tomoya ishida) 20 days ago 1Actions #3 [ruby-core:124668]

Isn't it impossible? To match, regexp needs to satisfy negative lookahead, so there should not be anything to capture.

As you wrote, captures are not available OUTSIDE of negative lookahead and also in MatchData.
But in /(?!([a-z])\1)[a-z]{2}/, \1 is actually using the capture. It's available INSIDE negative lookahead. As a result, "aa" that matches ([a-z])\1 doesn't match to the overall regexp. So capture in negative lookahead is a valid and meaningful pattern.

Updated by trinistr (Alexander Bulancov) 20 days ago Actions #4 [ruby-core:124670]

Thank you, I understand now what you meant.

Should this issue be changed to a feature request?

Updated by Eregon (Benoit Daloze) 20 days ago Actions #5 [ruby-core:124671]

  • Status changed from Open to Closed

An interesting fact is TruffleRuby/TRegex seems to report exactly the opposite for these 4 Regexp whether they are linear:

truffleruby 33.0.1 (2026-01-20), like ruby 3.3.7, Oracle GraalVM Native [x86_64-linux]
irb(main):001> Regexp.linear_time?(/(?=(a))/)
=> true
irb(main):002> Regexp.linear_time?(/(?<=(a))/)
=> false
irb(main):003> Regexp.linear_time?(/x+(?=(a))/)
=> true
irb(main):004> Regexp.linear_time?(/x+(?<=(a))/)
=> false
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux]
irb(main):001> Regexp.linear_time?(/(?=(a))/)
=> false
irb(main):002> Regexp.linear_time?(/(?<=(a))/)
=> true
irb(main):003> Regexp.linear_time?(/x+(?=(a))/)
=> false
irb(main):004> Regexp.linear_time?(/x+(?<=(a))/)
=> true

I think that means it depends a lot on the specifics of the Regexp engine implementation.
I made a summary back then in https://bugs.ruby-lang.org/issues/19104#note-3

(FWIW I think /x+(?<=(a))/ can never match, it would need to match both 'x' and 'a' for the same input character)

Should this issue be changed to a feature request?

I think we should close this rather.
And if you want something specific and have a use for it then filing a new feature request is best.

FWIW I saw on 4.0.1 Regexp.linear_time?(/(?=a)/) is true but Regexp.linear_time?(/(?=(a))/) is false.
I don't think capture groups in lookahead or lookbehind are common though.

Updated by Eregon (Benoit Daloze) 19 days ago Actions #6 [ruby-core:124672]

Useful context for this issue which would make sense to add the description is this Regexp item from the NEWS of 3.3:
https://github.com/ruby/ruby/blob/master/doc/NEWS/NEWS-3.3.0.md

The cache-based optimization now supports lookarounds and atomic groupings. That is, match for Regexp containing these extensions can now also be performed in linear time to the length of the input string. However, these cannot contain captures and cannot be nested. [Feature #19725]

Updated by trinistr (Alexander Bulancov) 18 days ago Actions #7 [ruby-core:124686]

Useful context for this issue which would make sense to add the description is this Regexp item from the NEWS of 3.3

Yes, thank you, that's what lead to me making the wrong assumption about linearity of lookbehind.

I don't think capture groups in lookahead or lookbehind are common though.

Suprisingly, there is a smattering of capturing lookaheads in Ruby distribution (for example, ext/extmk and optparse), though definitely not common.

I think we should close this rather.

Good enough for me.

Actions

Also available in: PDF Atom