Project

General

Profile

Actions

Bug #1740

closed

ruby regexp 100% usage cpu.

Added by paranormal (paranormal dev) over 15 years ago. Updated over 13 years ago.

Status:
Rejected
Assignee:
-
ruby -v:
ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]
[ruby-core:24188]

Description

=begin
On freebsd i'm test ruby
ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

And my linux notebook
ruby -v ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux]

For this code
#######################################
require 'open-uri'
$KCODE = 'u'

reg = %r{<.?div\sclass\s*=\s*.entry.?>[^<]<.?img\ssrc\s*=\s*.([^"|']).?>[^<]<.?p\sclass\s=\s*.date.?>}im
#del = %r{<(?!p|div|img)[^>]
>}i

doc = open('http://www.radiokvit.com.ua/?p=1895').read

#doc.gsub!(del, ' ')
a = doc.match(reg)
p a
######################################

My ruby process use 100% cpu for long time and on linux exit normaly, on freebsd no exit %-(.
I'm submited another bug for freebsd http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only.

This templates writes another man for perl and i'm must use here.
=end


Files

testfile.html (23.3 KB) testfile.html doc for test regexp. if url is not valid. paranormal (paranormal dev), 07/07/2009 11:38 PM
Actions #1

Updated by rue (Eero Saynatkari) over 15 years ago

=begin
Excerpts from rubymine message of Tue Jul 07 17:38:10 +0300 2009:

reg =
%r{<.?div\sclass\s*=\s*.entry.?>[^<]<.?img\ssrc\s*=\s*.([^"|']).?>[^<]<
.
?p\sclass\s=\s*.date.?>}im
#del = %r{<(?!p|div|img)[^>]
>}i

My ruby process use 100% cpu for long time and on linux exit normaly, on
freebsd no exit %-(.
I'm submited another bug for freebsd
http://www.freebsd.org/cgi/query-pr.cgi?pr=136384 but this is for freebsd only.

This templates writes another man for perl and i'm must use here.

Firstly, Ruby regexps are not PCRE, so you must have some
leeway constructing the regexp. You cannot (necessarily)
just drop the Perl version in and expect it to work, or
work the same.

Secondly, you should be using something like Nokogiri or
hpricot rather than "parsing" the HTML yourself. For example
your div matcher will fail if the attribute is quoted.

Thirdly, it has "pathological" written all over it. You
should refactor the regexp to try to get some small case
that is reproducible to illustrate the actual problem to
see if it is something that should be fixed.

I am pretty sure there was another thread about really bad
regexp performance in a pathological case a while back, if
you want to search the archives.

Eero

Magic is insufficiently advanced technology.

=end

Actions #2

Updated by nobu (Nobuyoshi Nakada) over 15 years ago

  • ruby -v changed from ruby 1.8.7 (2009-06-08 patchlevel 173) [x86_64-linux] to ruby 1.8.7 (2009-04-08 patchlevel 160) [i386-freebsd6]

=begin

=end

Actions #3

Updated by nobu (Nobuyoshi Nakada) over 15 years ago

  • Status changed from Open to Rejected

=begin
Too many backtracks consume a lot of time.
You can use (?>...) to suppress backtracking:
reg = %r{(?><div\sclass\s=\s*.entry.?>.?<img\b[^<>]\s+src\s=\s*.([^\"|\']).?>).?<p\sclass\s*=\s*.date.*?>}im

=end

Actions #4

Updated by paranormal (paranormal dev) over 15 years ago

=begin
I'm rewriting one big program, and write compatible layer before all refactoring done. And this regexp bad, because it write this program.
=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0