Bug #1740

ruby regexp 100% usage cpu.

Added by paranormal dev over 2 years ago. Updated 9 months ago.

[ruby-core:24188]
Status:Rejected Start date:07/07/2009
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:-
Target version:-
ruby -v:ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

Description

On freebsd i'm test ruby
ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

And my linux notebook 
ruby -v ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux]

For this code 
#######################################
require 'open-uri'
$KCODE = 'u' 

reg = %r{<.*?div\s*class\s*=\s*.entry.*?>[^<]*<.*?img\s*src\s*=\s*.([^"|']*).*?>[^<]*<.*?p\s*class\s*=\s*.date.*?>}im
#del = %r{<(?!p|div|img)[^>]*>}i

doc = open('http://www.radiokvit.com.ua/?p=1895').read

#doc.gsub!(del, ' ')
a = doc.match(reg)
p a
######################################

My ruby process use 100% cpu for long time and on linux exit normaly, on freebsd no exit %-(.
I'm submited another bug for freebsd http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only. 

This templates writes another man for perl and i'm must use here.

testfile.html - doc for test regexp. if url is not valid. (23.3 kB) paranormal dev, 07/07/2009 11:38 pm

History

Updated by Eero Saynatkari over 2 years ago

Excerpts from rubymine message of Tue Jul 07 17:38:10 +0300 2009:
> reg =
> %r{<.*?div\s*class\s*=\s*.entry.*?>[^<]*<.*?img\s*src\s*=\s*.([^"|']*).*?>[^<]*<
> .*?p\s*class\s*=\s*.date.*?>}im
> #del = %r{<(?!p|div|img)[^>]*>}i
>
> My ruby process use 100% cpu for long time and on linux exit normaly, on
> freebsd no exit %-(.
> I'm submited another bug for freebsd
> http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only. 
> 
> This templates writes another man for perl and i'm must use here.

Firstly, Ruby regexps are not PCRE, so you must have some
leeway constructing the regexp. You cannot (necessarily)
just drop the Perl version in and expect it to work, or
work the same.

Secondly, you should be using something like Nokogiri or
hpricot rather than "parsing" the HTML yourself. For example
your div matcher will fail if the attribute is quoted.

Thirdly, it has "pathological" written all over it. You
should refactor the regexp to try to get some small case
that is reproducible to illustrate the actual problem to
see if it is something that should be fixed.

I am pretty sure there was another thread about really bad
regexp performance in a pathological case a while back, if
you want to search the archives.


Eero
--
Magic is insufficiently advanced technology.

Updated by Nobuyoshi Nakada over 2 years ago

  • ruby -v changed from ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux] to ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

Updated by Nobuyoshi Nakada over 2 years ago

  • Status changed from Open to Rejected
Too many backtracks consume a lot of time.
You can use (?>...) to suppress backtracking:
  reg = %r{(?><div\s*class\s*=\s*.entry.*?>.*?<img\b[^<>]*\s+src\s*=\s*.([^\"|\']*).*?>).*?<p\s*class\s*=\s*.date.*?>}im

Updated by paranormal dev over 2 years ago

I'm rewriting one big program, and write compatible layer before all refactoring done. And this regexp bad, because it write this program.

Also available in: Atom PDF