Bug #3679

pathological regular expressions and exponential computation time

Added by Jamie Heilman over 1 year ago. Updated 10 months ago.

[ruby-core:31673]
Status:Rejected Start date:08/11/2010
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:core
Target version:-
ruby -v:ruby 1.9.2dev (2010-07-11 revision 28618) [x86_64-linux]

Description

$ time ruby -e '"a" * 25 + "b" =~ /(?:a|a)+\z/'

real    0m6.686s
user    0m6.683s
sys     0m0.000s
$ time ruby -e '"a" * 26 + "b" =~ /(?:a|a)+\z/'

real    0m13.366s
user    0m13.319s
sys     0m0.037s
$ time ruby -e '"a" * 27 + "b" =~ /(?:a|a)+\z/'

real    0m26.712s
user    0m26.698s
sys     0m0.000s
$ time ruby -e '"a" * 17 + "b" =~ /(?:a|a|a)+\z/'

real    0m18.016s
user    0m18.005s
sys     0m0.000s
$ time ruby -e '"a" * 18 + "b" =~ /(?:a|a|a)+\z/'

real    0m54.049s
user    0m53.996s
sys     0m0.017s

this will pretty much hold true for any range of characters "a" provided the character(s) "b" are outside of that range, and you can tease this into using insane amounts of processing time by increasing the amount of overlap in the regular expression, iow:
$ time ruby -e '"ab" * 14 + "c" =~ /(?:[ab]|[ab])+\z/'

real    0m53.834s
user    0m53.780s
sys     0m0.020s


I ran into this due to some input validation code that was Regexp.union()'ing together various character classes carelessly and ending up with some overlap, the net effect of which was that certain invalid inputs were causing cpu exhaustion and application outages.  This would appear to affect both ruby 1.9 and 1.8 ... I didn't test further back.

History

Updated by Yui NARUSE over 1 year ago

  • Status changed from Open to Rejected
Write carefully. NFD based regexp is such thing.

Updated by Jamie Heilman over 1 year ago

That precludes optimization?

Updated by Yui NARUSE over 1 year ago

No, it's not a problem of optimization.
It's a problem from backtracking.
For further information, read this http://swtch.com/~rsc/regexp/regexp1.html

Updated by Jamie Heilman over 1 year ago

OK, I thought you were referring to normalized form canonical decomposition.  I haven't studied Ruby's regexp implementation much, but I know Unicode poses a lot of implementation and performance challenges.  Anyway, if isn't a problem of optimization, then why do Python, Perl, and PCRE all handle this case without grief?  I've read the referenced article already, and I have some appreciation for the complexity of the issue, but one of the only reasons I bothered to file a bug in the first place was because there must clearly exist effective mitigation for this particular degenerate case.

Updated by Josh ben Jore over 1 year ago

I apologise in advance for not being informative enough. Perl does in fact suffer from this as well since Perl's computational basis for regexps is in the same paradigm as Ruby's. We have however accumulated a number of optimizations and get-out-of-jail cards over the years. I opened http://perl5.git.perl.org/perl.git/blob/HEAD:/regexec.c just now to see what might show up labeled as an optimization or about being exponential. Sadly, I think I'd need to do a close reading of the source to find the escape clauses.

It might be worth reading the commit notes for this source file. Separately, you can view intimate details of the Perl re compilation by enabling debugging for either lexical scope or whatever. I suggest this to you in the hopes you might use it to find useful short-cuts that Oniguruma isn't yet.

In particular, there's the Debug and debug flags to the re.pm pragma and used in context for your examples:

   perl -e 'use re qw( Debug INTUIT OPTIMISE OPTIMISEM ); ("a" x 27 . "b") =~ /(?:a|a|a)+\z/'

Want to chat about this over lunch or a beer sometime? I'm in south lake union just north of you.

Updated by Yui NARUSE over 1 year ago

> OK, I thought you were referring to normalized form canonical decomposition.
Oh, sorry, I meant NFA.
http://en.wikipedia.org/wiki/Nondeterministic_finite-state_machine
http://oreilly.com/catalog/regex/chapter/ch04.html

Also available in: Atom PDF