Feature #5588

add negation flag (v) to Regexp

Added by Suraj Kurapati over 2 years ago. Updated over 1 year ago.

[ruby-core:40824]
Status:Feedback
Priority:Normal
Assignee:Yui NARUSE
Category:core
Target version:next minor

Description

Please add a negation flag (v) to regexps which inverts them:

"ruby" =~ /perl/v #=> true (turn on negation)
"ruby" !~ /perl/v #=> false (turn on negation)
"ruby" =~ /(?v:perl)/ #=> true (turn on negation)
"ruby" !~ /(?v:perl)/ #=> false (turn on negation)
"ruby" =~ /(?-v:perl)/ #=> false (turn off negation)
"ruby" !~ /(?-v:perl)/ #=> true (turn off negation)

The flag name "v" comes from the ex(1) command of the same name.

This has beneficial applications where it is sometimes difficult
to construct what you want to match but much easier to construct
what you do not want to match. Having this negation built in
the regexp itself would remove the need for us to change our
Ruby code to process a regexp in a different way.

For example, suppose that you are passing a regexp to the --name
command-line option of MiniTest. This regexp tells MiniTest to
only run those tests whose names match. If Ruby had a negation
flag on its regexps, then it would be suddenly trivial to make
MiniTest skip certain tests by negating the regexp we pass in.

In this manner, we get a beneficial feature without ever changing
our Ruby code (the codebase of MiniTest in this example). :-)

Thanks for your consideration.

5588_regexp_v.patch Magnifier - implementation done, but need to write RDoc (15.7 KB) Suraj Kurapati, 11/13/2011 06:59 PM

0001-http-redmine.ruby-lang.org-issues-5588.patch Magnifier - both implementation & RDoc done (18.4 KB) Suraj Kurapati, 11/15/2011 04:55 AM

5588_negative_lookahead.patch Magnifier (14.1 KB) Suraj Kurapati, 11/16/2011 02:42 PM

5588_negative_lookahead.patch Magnifier - updated test case (14.3 KB) Suraj Kurapati, 11/17/2011 07:50 AM

History

#1 Updated by Suraj Kurapati over 2 years ago

Note that I am not asking for the ability to take an arbitrary regexp and construct its negated form (that is a hard problem). Instead, I am asking for a flag that simply inverts how a regexp is treated by the =~ and !~ operators (this is not a hard problem). The negation flag should not change the Regexp#source of a regexp.

Thanks for your consideration.

#2 Updated by Konstantin Haase over 2 years ago

It is not to hard to negate a regexp, though. Simply wrap it as negative look-ahead.

Konstantin

On Nov 7, 2011, at 23:04 , Suraj Kurapati wrote:

Issue #5588 has been updated by Suraj Kurapati.

Note that I am not asking for the ability to take an arbitrary regexp and construct its negated form (that is a hard problem). Instead, I am asking for a flag that simply inverts how a regexp is treated by the

#3 Updated by Suraj Kurapati over 2 years ago

Good idea, but negative lookahead isn't the same as negation:

$ irb

ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux]

"rubyperl" =~ /(?!perl)/
0
"rubyperl" =~ /perl/
4
"rubyperl" !~ /perl/
false

Thanks for your consideration.

#4 Updated by Yukihiro Matsumoto over 2 years ago

How do you treat positional information with v flag?

#5 Updated by Alexey Muranov over 2 years ago

I would suggest to deal with it a new (sub)class instead: arbitrary boolean combinations of regexps, without regard for positional information.

#6 Updated by Suraj Kurapati over 2 years ago

Hello Matz,

What do you mean by positional information:

  • anchors (\A, , $, \G, etc.) ?
  • capture-group numbers ($1, $2, $3, etc.) ?
  • MatchData#begin, #end, #offset ?

The v flag should only affect the low-level =~ and !~ operators in the C implementation. All subsequent processing should be performed as normal.

Thanks for your consideration.

#7 Updated by Suraj Kurapati over 2 years ago

I tried to implement this feature in this patch:
https://github.com/sunaku/ruby/commit/79305ba55c7ece5501c9219942eaf30e01a370a9

I was able to make Ruby recognize 'v' as an embedded regexp flag,
and was able to create regexps via the new Regexp::NEGATED constant.

However, I am stuck on the following things. Any tips? :-)

  • We can pass '(?v:)' in embedded regexp but it does not take effect.
    The resulting regexp object's #options field does not reflect 'v'.

  • Make the parser accept 'v' as option at the end of literal regexps.

Thanks for your consideration.

#8 Updated by Akira Tanaka over 2 years ago

2011/11/9 Suraj Kurapati sunaku@gmail.com:

  • We can pass '(?v:)' in embedded regexp but it does not take effect. The resulting regexp object's #options field does not reflect 'v'.

The option can be embedded to middle of a regexp: /foo(?v:bar)baz/

I think it doesn't work.
--
Tanaka Akira

#9 Updated by Suraj Kurapati over 2 years ago

You are correct, Tanaka. I have been exploring the regexp implementation to understand why and I learned that the negate flag must be compiled as an opcode (similar to multiline and ignorecase) into the repatternbuffer->p string.

My current approach is to compile (?v:...) into two opcodes (OPBEGINNEGATE and OPENDNEGATE) with normal compilation of the ... stuff inside. Later, when matchat() in regexec.c:1254 is processing the compiled regexp and encounters OPEND_NEGATE, it will perform the negation:

If we consumed > 0 input characters since OPBEGINNEGATE, then we have successfully matched the ... stuff inside the original (?v:...) regexp. So we need to stop further processing by returning ONIG_MISMATCH.

Otherwise, we did not consume any input characters since OPBEGINNEGATE, so we continue processing by returning ONIG_NORMAL. This effectively treats the (?v:...) as a zero-length regexp, which always matches of course.

That's my plan for now. Any comments?

Thanks for your consideration.

#10 Updated by Suraj Kurapati over 2 years ago

After several deep excursions into the regexp codebase, I've had enough. :)

As Tanaka(!) showed in 2007, negative global regexps are already there:

http://www.ruby-forum.com/topic/133413#595368

"rubyperl" =~ /(?!perl.)+$/
nil

So you can close this feature request now. Sorry for the noise.

#11 Updated by Suraj Kurapati over 2 years ago

Alas, I was unable to resist the lure of implementing this, so I'm back to give this another try.

My current approach is to expand (?v:STUFF) into (?:(?!STUFF).) when the regexp AST is built.

The goal is to have an negated embedded regexp say: "there is something here that is NOT this".

#12 Updated by Suraj Kurapati over 2 years ago

I did it! _^ Please take a look:

https://github.com/sunaku/ruby/compare/5588_regexp_v

There are a few issues remaining with the implementation:

  • Store snegate on STACK support nested embedded negated regexps.

  • Find a better way to double-pop the stack in OPNEGATEEND handler in regexec.c; currently it puts undefined values into s, p, etc. when the stack is popped for the second time in "goto fail" handler.

After that, I need to make Ruby parser accept /.../v as a literal regexp flag.

Thanks for your consideration.

#13 Updated by Suraj Kurapati over 2 years ago

Allow me to explain the current embedded negated regexp implementation.

When parsing an embedded negated regexp (?v:r), we expand them into this:

OPNEGATESTART(?:r)?OPNEGATEEND.*?

Here, OPNEGATESTART and OPNEGATEEND are opcodes in compiled pattern.

When the regexp engine (see matchat() in regexec.c) reaches OPNEGATE_START, we store the current state of input (up to which character of the input string have we consumed/matched so far?) in the "snegate" variable.

The regexp engine then continues onward and eventually reaches OPNEGATEEND. At this point, I compare the current state of input with "snegate". This tells us if the original embedded negated regexp (?v:r) has matched anything. Now we perform the negation:

If (?v:r) matched something, then treat this as a mismatch (prevent backtrack and goto fail). Otherwise, continue processing (the ".*?" after OPNEGATEEND will take care of consuming any non-matching characters so that we can still proceed).

assert_no_match(/a(?v:b)c/, "abc")
assert_match(/a(?v:b)c/, "axc")
assert_match(/a(?v:b)c/, "ac")
assert_match(/a(?v:b)c/, "axbc")

I hope this helps you understand my approach. Please correct me if I made a mistake.

Thanks for your consideration.

#14 Updated by Suraj Kurapati over 2 years ago

I have updated my patch to emit a single OP_NEGATE opcode after the negated embedded regexp (?v:...). This opcode double-pops the stack to prevent the optional alternation from succeeding. Please take a look:

https://github.com/sunaku/ruby/compare/5588_regexp_v

There might be a bug in the "(?v:r)" to "(?:rN)?" expansion (where "N" is OPNEGATE) because DONIGDEBUGPARSETREE shows the expanded "(?:rN)?" twice:

PATTERN: /a(?v:b)c/ (ASCII-8BIT)
list:965430
string:96af40a
enclose:965570 option:4096
quantifier:965520{0,1}
string:965480b

  <quantifier:965520>{0,1}      <=== BUG?  Why is this here twice?
     <string:965480>b

quantifier:96c830{0,-1}?
anychar:96ae00
string:96c920c

I will iron out these issues and finish the implementation in due time.

Cheers.

#15 Updated by Suraj Kurapati over 2 years ago

The double-printing of ENCLOSE_OPTION node was a bug in Oniguruma 5.9.2 and not in my code, for once! ;) I have submitted a fix for that bug to Kosako, the author of Oniguruma, accordingly. In case you are interested, here is the bug fix:

https://github.com/sunaku/ruby/commit/125c31a0fe42fb2937ea64c2f31283b81bb32d8b

Cheers.

#16 Updated by Suraj Kurapati over 2 years ago

I fixed the 'v' flag parsing in literal regexps: the problem was the value of ONIGOPTIONNEGATE that I chose (0x1000) collided with RBENCODINGOPTION mask (0xFF00).

Now the implementation is finally finished. Please review the attached patch and tell me what you think. If you like it, I will update the RDoc of the Regexp functions to reflect the new 'v' flag.

Thanks for your consideration.

#17 Updated by Suraj Kurapati over 2 years ago

I beautified my patches so they are easier to understand (especially the additions to test_regexp.rb in Ruby's test suite) and extracted the Oniguruma-only portions into a separate patch so that you can see what changes affect Ruby vs. Oniguruma. Here are the patches:

I would like your feedback on these. Thanks for your consideration.

#18 Updated by Suraj Kurapati over 2 years ago

I have explained this implementation in more detail on my blog:

http://snk.tuxfamily.org/log/oniguruma-negated-regexps.html

I hope that helps. Cheers. :)

#20 Updated by Yui NARUSE over 2 years ago

With your 's patch, I got following result.
Is this an expected result?

irb(main):029:0> /a(?v:b)c/=~"abc"
=> nil
irb(main):030:0> /a(?v:b)c/=~"abc"
=> nil
irb(main):031:0> /a(?v:b)c/=~"a
bc"
=> 0

#21 Updated by Suraj Kurapati over 2 years ago

Hi Naruse,

Thanks for trying my patch and for your questions! :)

Yui NARUSE wrote:

With your 's patch, I got following result.
Is this an expected result?

Yes, please allow me to explain why:

irb(main):029:0> /a(?v:b)c/=~"abc"
=> nil

First /a/ matched "a", then /(?v:b)/ mismatched "b". Failure.

irb(main):030:0> /a(?v:b)c/=~"ab_c"
=> nil

First /a/ matched "a", then /(?v:b)/ mismatched "b". Failure.

irb(main):031:0> /a(?v:b)c/=~"a_bc"
=> 0

First /a/ matched "a", then /(?v:b)/ matched "", then /.?/
(created when /(?v:b)/ was expanded into /(?:bN)?.
?/ where N is
OP
NEGATE) matched "b", and finally /c/ matched "c". Success.

See also my explanation of partly negated regexp expansion:
http://snk.tuxfamily.org/log/oniguruma-negated-regexps.html

Thanks for your consideration.

#22 Updated by Suraj Kurapati over 2 years ago

Suraj Kurapati wrote:

Yui NARUSE wrote:

irb(main):031:0> /a(?v:b)c/=~"a_bc"
=> 0

then /(?v:b)/ matched "_"

Hmm, that explanation isn't fully accurate. Let me try again:

First /a/ matched "a", then /(?:bN)?.*?/ (which is the parse-tree
expansion of /(?v:b)/) matched "_b", and finally /c/ matched "c".
Success.

Sorry for the confusion.

#23 Updated by Yui NARUSE over 2 years ago

  • Category set to core
  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE
  • Target version set to 2.0.0

Thank you for detailed explanation,
And sorry, explains it.
I believe this spec is wrong; (?v:foo) should match a sequence which doesn't include "foo".
OR a sequence just match "foo".

For example, /"(?v:<|&|")"/, AttValue of XML http://www.w3.org/TR/xml/#NT-AttValue
The behavior of this regexp seems /"[<&"]*"/ but

irb(main):007:0> /"[<&"]*"/=~'"aa<&a"'
=> nil
irb(main):008:0> /"(?v:<|&|")"/=~'"aa<&a"'
=> 0

This feels strange.

Do you have any use case which show current behavior is more reasonable?

#24 Updated by Suraj Kurapati over 2 years ago

I don't have a use case for the current behavior; it was just the
simplest way to remove non-matching input characters that obstructed
the matching engine. And I agree with your examples; people would
naturally think of /(?v:...)/ as a glorified form of /[...]/.

The solution is to change the parse-tree expansion into this:

/(?v:r)/ => /(?:(?:rN)?.)/

In this manner, the /(?:rN)/ acts as a barrier that only allows
input characters that do not match r to be matched by the /./.

However, this seems very similar to Tanaka's 2007 solution1:

/(?v:r)/ => /(?:(?!r).)/

I will play with Oniguruma in GDB some more to learn how (?!) works.
Perhaps my OP_NEGATE modification is actually unnecessary. Cheers.

#25 Updated by Suraj Kurapati over 2 years ago

Hello Naruse,

I have updated my patch to expand /(?v:r)/ into /(?:(?!rN).)/:
https://github.com/sunaku/ruby/compare/5588_regexp_v

It seems that OP_NEGATE is necessary after all, because without it,
Oniguruma will try to match a non-anchored partly negated regexp to
the rest of the input string.

For example, when processing /(?v:ruby)/ =~ "ruby", Oniguruma does:

  1. /(?v:ruby)/ =~ "ruby" # failure
  2. /(?v:ruby)/ =~ "uby" # success! return
  3. /(?v:ruby)/ =~ "by" # success! (illustation)
  4. /(?v:ruby)/ =~ "y" # success! (illustation)
  5. /(?v:ruby)/ =~ "" # failure (illustation)

Of course, Oniguruma stops at the first success (step 2). I added
the rest of the steps to illustrate how it continues trying to match
the rest of the input when a non-anchored regexp fails.

I had encountered this problem previously when coding OPNEGATE, and
solved it by returning a special value (ONIG
MISMATCHFROMNEGATE).
Now I simply re-used that existing logic for (?!) expanded in (?v:).

I have added your examples to the test_regexp.rb suite now.

Thanks for your consideration.

#26 Updated by Suraj Kurapati over 2 years ago

I am uncertain whether /(?v:ruby)/ =~ "ruby" should return nil or 1.

What do you think?

#27 Updated by Suraj Kurapati over 2 years ago

Alright, I have decided that partly negated regexps should behave like Tanaka's 2007 solution. They are easier to reason about (consistently) in that form:

/"(?v:ruby)+"/ =~ %q("ruby") # yields nil
/"(?v:ruby)+"/ =~ %q("rubyperl") # yields nil
/"(?v:ruby)+"/ =~ %q("perlruby") # yields nil
/"(?v:ruby)+"/ =~ %q(abc"perlru-by"xyz) # yields 3 and $& is %q("perlru-by")

Please review my simplified patch (attached) that no longer uses OP_NEGATE.

Thanks for your consideration.

#28 Updated by Suraj Kurapati over 2 years ago

Attaching patch with updated test case to illustrate
how unanchored partly negated regexp matching works:

/(?v:ruby)/ =~ "ruby"                #=> 1
["r", "u", "by"] == [$`, $&, $']     #=> true

Thanks for your consideration.

#29 Updated by Yui NARUSE over 2 years ago

I doubt this function under the current implementation should be flags because /ruby/v =~ "ruby" is now useless example.
I think people who want to use this negation flag should simply write (?:(?!r).).

Moreover it has a bug, for example over http://www.ruby-forum.com/topic/133413
If the suffix is not "dog" but "tv", the regexp may be /cat((?:(?!cat).)*)tv/.
But as following, it has false negative.

irb(main):013:0> /cat((?:(?!cat).)*)tv/=~"cat foo bar catv"
=> nil

The missing piece of your proposal is a use case.
All existing examples are too artificial.
Design should prior to implementations, and use cases should prior to designs.

I'm interesting in the idea negation flag.
But your proposal is limited by implementation.

Use cases I know is
* comments of C Language: /* ... /
* SGML CDATA: <![CDATA[ ... ]]>
* HTML 2.0 (RFC 1866) 3.2.5. Comments: /<!(--[-]
(-[-]+)--)>/
* HTML 4.0/XML Comments: /<!--[-](-[-]+)-->/
* HTTP header: until CRLFCRLF
* sequences (lines, sentences, paragraphs, and so on) which doesn't include a word

#30 Updated by Suraj Kurapati over 2 years ago

Yui NARUSE wrote:

I doubt this function under the current implementation should be
flags because /ruby/v =~ "ruby" is now useless example. I think
people who want to use this negation flag should simply write
(?:(?!r).).

I respectfully disagree; for me, wholly negated regexps are more
useful than partly negated ones. Please see my use cases below.

Moreover it has a bug, for example over
http://www.ruby-forum.com/topic/133413 If the suffix is not "dog"
but "tv", the regexp may be /cat((?:(?!cat).)*)tv/. But as
following, it has false negative.

irb(main):013:0> /cat((?:(?!cat).)*)tv/=~"cat foo bar catv"
=> nil

It works if you add a word-boundary anchor at the end of "cat":

/cat((?:(?!cat\b).)*)tv/=~"cat foo bar catv"
0
$&
"cat foo bar catv"

The missing piece of your proposal is a use case. All existing
examples are too artificial. Design should prior to
implementations, and use cases should prior to designs.

Very true, thanks for this much needed criticism.

I'm interesting in the idea negation flag. But your proposal is
limited by implementation.

I only have use cases for wholly negated regexps (/.../v):

  • some_enumerable.grep(/.../v)
  • somestring =~ someregexp # where some_regexp given by user
  • case some_string; when /.../v; end

That is why I became confused when implementing partly negated
regexps (/(?v:)/).

Use cases I know is
* comments of C Language: /* ... /
* SGML CDATA: <![CDATA[ ... ]]>
* HTML 2.0 (RFC 1866) 3.2.5. Comments: /<!(--[-]
(-[-]+)--)>/
* HTML 4.0/XML Comments: /<!--[-](-[-]+)-->/
* HTTP header: until CRLFCRLF
* sequences (lines, sentences, paragraphs, and so on) which doesn't include a word

These seem like good use cases of partly negated regexps (/(?v:)/).

#31 Updated by Yui NARUSE over 2 years ago

Suraj Kurapati wrote:

Yui NARUSE wrote:

I doubt this function under the current implementation should be
flags because /ruby/v =~ "ruby" is now useless example. I think
people who want to use this negation flag should simply write
(?:(?!r).).

I respectfully disagree; for me, wholly negated regexps are more
useful than partly negated ones. Please see my use cases below.

Ah, you think /v is still wholly negated regexp, i see.

Moreover it has a bug, for example over
http://www.ruby-forum.com/topic/133413 If the suffix is not "dog"
but "tv", the regexp may be /cat((?:(?!cat).)*)tv/. But as
following, it has false negative.

irb(main):013:0> /cat((?:(?!cat).)*)tv/=~"cat foo bar catv"
=> nil

It works if you add a word-boundary anchor at the end of "cat":

/cat((?:(?!cat\b).)*)tv/=~"cat foo bar catv"
0
$&
"cat foo bar catv"

This \b hack can only work when "t" and "v" is the same kind.
When replace "v" to "!", this won't work.

/cat((?:(?!cat\b).)*)t!/=~"cat foo bar cat!"
=> nil

The missing piece of your proposal is a use case. All existing
examples are too artificial. Design should prior to
implementations, and use cases should prior to designs.

Very true, thanks for this much needed criticism.

I'm interesting in the idea negation flag. But your proposal is
limited by implementation.

I only have use cases for wholly negated regexps (/.../v):

  • some_enumerable.grep(/.../v)
  • somestring =~ someregexp # where some_regexp given by user
  • case some_string; when /.../v; end

That is why I became confused when implementing partly negated
regexps (/(?v:)/).

They seems reasonable.
If you suggested only wholly one with such use case, this discussion would be more simple.

#32 Updated by Suraj Kurapati over 2 years ago

Interesting. Thanks for your feedback. I will submit a new patch that only contains wholly negated regexps (/.../v) this weekend. Cheers.

#33 Updated by Yui NARUSE almost 2 years ago

  • Status changed from Assigned to Feedback

#34 Updated by Nobuyoshi Nakada almost 2 years ago

=begin
What will (({Regexp#match})) with (({v})) flag return, and what will set to (({$~}))?
=end

#35 Updated by Yutaka HARA over 1 year ago

  • Target version changed from 2.0.0 to next minor

Also available in: Atom PDF