Bug #9694

Bad regexp hangs ruby

Added by Nikolay Markov over 1 year ago. Updated over 1 year ago.

[ruby-core:61810]
Status:Open
Priority:Normal
Assignee:-
ruby -v:ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-darwin13.0] Backport:2.0.0: UNKNOWN, 2.1: UNKNOWN

Description

Here is an extracted problem i ran into recently:
~~~
$ cat test.rb

str = ('a' * ARGV[0].to_i) + '?'
re = /(\w*)*$/
re.match(str)
~~~

On few chars match returns quite fast, but here's what happens on 14 'a'-s and up:
~~~
$ time RBENV_VERSION=2.1.1 ruby test.rb 14

real 1.392 user 1.364 sys 0.026 pcpu 99.83

$ time RBENV_VERSION=2.1.1 ruby test.rb 15

real 3.979 user 3.949 sys 0.026 pcpu 99.89

$ time RBENV_VERSION=2.1.1 ruby test.rb 16

real 11.995 user 11.954 sys 0.031 pcpu 99.92
~~~

Ruby versions 1.9.3 and 2.0 behave similarly.

I ran into the problem, because one of my colleagues copy-pasted this regexp to test url's somewhere from stackoverflow:
/https?:\\/)?([\da-z.-]+).([a-z.]{2,6})([\/\w .-:])\/?$/

I know the regexp is useless, however i think it's still a problem if a bad regexp can hang ruby.
Python (2.7) says that this regexp is bad and does not compile it.
Perl matches without any performance issues

History

#1 Updated by Rafael França over 1 year ago

I tried to reproduce with your script and could not:

$ ruby -v
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-darwin13.0]

$ cat foo.rb
str = ('a' * ARGV[0].to_i) + '?'
re = /(\w)$/
re.match(str)

$ time ruby foo.rb 14

real    0m0.134s
user    0m0.075s
sys 0m0.050s

$ time ruby foo.rb 15

real    0m0.145s
user    0m0.082s
sys 0m0.054s

$ time ruby foo.rb 16

real    0m0.137s
user    0m0.076s
sys 0m0.052s

$ time ruby foo.rb 50

real    0m0.142s
user    0m0.080s
sys 0m0.053s

What is the environment you are running?

#2 Updated by Gergo Erdosi over 1 year ago

For some reason the regex is not displayed correctly. I'm able to reproduce the reported issue (see the correct regex below):

$ cat test.rb 
str = ('a' * ARGV[0].to_i) + '?'
re = /(\w*)*$/
re.match(str)
$ time ruby test.rb 14

real    0m1.179s
user    0m1.128s
sys 0m0.004s
$ time ruby test.rb 15

real    0m3.568s
user    0m3.419s
sys 0m0.020s
$ time ruby test.rb 16

real    0m10.767s
user    0m10.276s
sys 0m0.067s

#3 Updated by Shyouhei Urabe over 1 year ago

Ruby's regexp engine is NP-complete. It's ultimately impossible to guarantee
regexp matches to run fast (if you don't think so please send us a proof). It
might be possible to warn your specific bad regexp, but in general it's also
impossible to tell which regexp is bad and which isn't. That's the Halintg
problem http://en.wikipedia.org/wiki/Halting_problem .

So, in short, there is (at least believed to be) no ultimate solution. All what
might be possible is to find a reasonable compromise. For instance python does
not detect that shorter one:

zsh % python
Python 2.7.5+ (default, Feb 27 2014, 19:37:08) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.compile(r'(\w*)*$')
<_sre.SRE_Pattern object at 0x1dee030>
>>> 

#4 Updated by Nikolay Markov over 1 year ago

Rafael, i'm sorry, the bad regexp is not displaying properly, something is obviously wrong with my formatting. Gergo reproduced it same as i have it.

Urabe, do you know how Perl does that? Also, i'll be grateful for the link to regexp sources in ruby

#5 Updated by Shyouhei Urabe over 1 year ago

Nikolay Markov wrote:

Urabe, do you know how Perl does that? Also, i'll be grateful for the link to regexp sources in ruby

Don't know anything about perl's. Ruby's regexp engine is called Onigmo: https://github.com/k-takata/Onigmo

Also available in: Atom PDF