Bug #9694
closedBad regexp hangs ruby
Description
Here is an extracted problem i ran into recently:
$ cat test.rb
str = ('a' * ARGV[0].to_i) + '?'
re = /(\w*)*$/
re.match(str)
On few chars match returns quite fast, but here's what happens on 14 'a'-s and up:
$ time RBENV_VERSION=2.1.1 ruby test.rb 14
real 1.392 user 1.364 sys 0.026 pcpu 99.83
$ time RBENV_VERSION=2.1.1 ruby test.rb 15
real 3.979 user 3.949 sys 0.026 pcpu 99.89
$ time RBENV_VERSION=2.1.1 ruby test.rb 16
real 11.995 user 11.954 sys 0.031 pcpu 99.92
Ruby versions 1.9.3 and 2.0 behave similarly.
I ran into the problem, because one of my colleagues copy-pasted this regexp to test url's somewhere from stackoverflow:
/^(https?://)?([\da-z.-]+).([a-z.]{2,6})([/\w .-:])/?$/
I know the regexp is useless, however i think it's still a problem if a bad regexp can hang ruby.
Python (2.7) says that this regexp is bad and does not compile it.
Perl matches without any performance issues
Updated by rafaelfranca (Rafael França) over 10 years ago
I tried to reproduce with your script and could not:
$ ruby -v
ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-darwin13.0]
$ cat foo.rb
str = ('a' * ARGV[0].to_i) + '?'
re = /(\w)$/
re.match(str)
$ time ruby foo.rb 14
real 0m0.134s
user 0m0.075s
sys 0m0.050s
$ time ruby foo.rb 15
real 0m0.145s
user 0m0.082s
sys 0m0.054s
$ time ruby foo.rb 16
real 0m0.137s
user 0m0.076s
sys 0m0.052s
$ time ruby foo.rb 50
real 0m0.142s
user 0m0.080s
sys 0m0.053s
What is the environment you are running?
Updated by gergoerdosi (Gergo Erdosi) over 10 years ago
For some reason the regex is not displayed correctly. I'm able to reproduce the reported issue (see the correct regex below):
$ cat test.rb
str = ('a' * ARGV[0].to_i) + '?'
re = /(\w*)*$/
re.match(str)
$ time ruby test.rb 14
real 0m1.179s
user 0m1.128s
sys 0m0.004s
$ time ruby test.rb 15
real 0m3.568s
user 0m3.419s
sys 0m0.020s
$ time ruby test.rb 16
real 0m10.767s
user 0m10.276s
sys 0m0.067s
Updated by shyouhei (Shyouhei Urabe) over 10 years ago
Ruby's regexp engine is NP-complete. It's ultimately impossible to guarantee
regexp matches to run fast (if you don't think so please send us a proof). It
might be possible to warn your specific bad regexp, but in general it's also
impossible to tell which regexp is bad and which isn't. That's the Halintg
problem http://en.wikipedia.org/wiki/Halting_problem .
So, in short, there is (at least believed to be) no ultimate solution. All what
might be possible is to find a reasonable compromise. For instance python does
not detect that shorter one:
zsh % python
Python 2.7.5+ (default, Feb 27 2014, 19:37:08)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.compile(r'(\w*)*$')
<_sre.SRE_Pattern object at 0x1dee030>
>>>
Updated by mxposed (Nikolay Markov) over 10 years ago
Rafael, i'm sorry, the bad regexp is not displaying properly, something is obviously wrong with my formatting. Gergo reproduced it same as i have it.
Urabe, do you know how Perl does that? Also, i'll be grateful for the link to regexp sources in ruby
Updated by shyouhei (Shyouhei Urabe) over 10 years ago
Nikolay Markov wrote:
Urabe, do you know how Perl does that? Also, i'll be grateful for the link to regexp sources in ruby
Don't know anything about perl's. Ruby's regexp engine is called Onigmo: https://github.com/k-takata/Onigmo
Updated by jeremyevans0 (Jeremy Evans) over 5 years ago
- Status changed from Open to Closed