Project

General

Profile

Actions

Bug #14418

closed

ruby 2.5 slow regexp execution

Added by jakub.wozny (Kuba W) almost 7 years ago. Updated over 1 year ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:85219]
Tags:

Description

I have simple regexp that performing very slow.

"fußball "*20 =~ /^([\S\s]{1000})/i

It works fast if I remove /i flag. I figured out that is also depends on string length or on quantifier value (in this case it is {1000}).
When you remove ß form the string it also works fast.

I tested on 2.3.1, 2.4.3 and 2.5.0.

I'm not sure it is a bug or it just works that way.

Updated by jakub.wozny (Kuba W) almost 7 years ago

I can't paste the code here corectly. I creted a gist with regexp: https://gist.github.com/kubaw/60ca998200d80883156fa94efa7eb6fe

Updated by sos4nt (Stefan Schüßler) almost 7 years ago

I can't paste the code here corectly.

You have to insert a blank line before ~~~

Updated by shevegen (Robert A. Heiler) almost 7 years ago

You have to insert a blank line before

I also often just insert four ' ' space characters before the code
I want to add; no idea if it is correctly interpreted but it seems
to work on both github and ruby-lang.org, so I tend to use it. :D

To the regexp performance, I have no idea if it is a bug or not,
but I think either way, it may be helpful to have some test code
that can test different regexps and correlate it with the "expected
speed outcome". That way issue requests like this could help people
before they report a (potential) issue, to see whether everything
works as-is or some kind of bug exists.

Since you use "ß", let me ask you - what encoding do you use within
the script? Possibly UTF-8? Have you tested if some ISO-encoding
makes a difference in regards to speed?

Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).

Updated by jakub.wozny (Kuba W) almost 7 years ago

Ok, Blow is the regexp that I tested. I used utf-8 encodnings at the begining:

"fußball "*20 =~ /([\S\s]{1000})/i

Some measurements:

 (0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/i } }
  0.000000   0.000000   0.000000 (  0.000481)
  0.000000   0.000000   0.000000 (  0.000079)
  0.000000   0.000000   0.000000 (  0.000246)
  0.000000   0.000000   0.000000 (  0.000751)
  0.010000   0.000000   0.010000 (  0.002447)
  0.000000   0.000000   0.000000 (  0.006554)
  0.010000   0.000000   0.010000 (  0.007416)
  0.020000   0.000000   0.020000 (  0.022623)
  0.070000   0.000000   0.070000 (  0.066888)
  0.200000   0.000000   0.200000 (  0.196393)
  0.590000   0.000000   0.590000 (  0.591980)
  1.770000   0.000000   1.770000 (  1.772828)
  5.290000   0.010000   5.300000 (  5.292948)
 15.860000   0.000000  15.860000 ( 15.868370)

I would expect that this code should work as fast as version without /i flag.

"fußball "*20 =~ /([\S\s]{1000})/

(0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/ } }
  0.000000   0.000000   0.000000 (  0.000036)
  0.000000   0.000000   0.000000 (  0.000009)
  0.000000   0.000000   0.000000 (  0.000011)
  0.000000   0.000000   0.000000 (  0.000016)
  0.000000   0.000000   0.000000 (  0.000018)
  0.000000   0.000000   0.000000 (  0.000029)
  0.000000   0.000000   0.000000 (  0.000020)
  0.000000   0.000000   0.000000 (  0.000021)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000016)
  0.000000   0.000000   0.000000 (  0.000027)
  0.000000   0.000000   0.000000 (  0.000022)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000026)
  0.000000   0.000000   0.000000 (  0.000025)
  0.000000   0.000000   0.000000 (  0.000026)
  0.000000   0.000000   0.000000 (  0.000053)

Another test cases:

Benchmark.measure { "ß "*20 =~ /^([\S\s]{20})/i } # 0.000000   0.000000   0.000000 (  0.000431)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{30})/i } # 0.000000   0.000000   0.000000 (  0.000427)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{40})/i } # 0.000000   0.000000   0.000000 (  0.000430)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/i } # too long to wait

#without /i flag:
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/ } #0.000000   0.000000   0.000000 (  0.000043)

I tested in other encodings:

Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/i}.to_s # => "  3.450000   0.000000   3.450000 (  3.452036)\n"

In case of other encoding, removing /i also speeds up:

Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/}.to_s #=> "  0.010000   0.000000   0.010000 (  0.000514)\n"

Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).

I have multilingual app so I need to stay in unicode.

Updated by nobu (Nobuyoshi Nakada) almost 7 years ago

  • Description updated (diff)

FYI, you can avoid it by using . instead of [\S\s].

Updated by duerst (Martin Dürst) almost 7 years ago

What happens essentially when using //i is that every 'ß' in the string (and in the regular expression) is expanded to 'ss', dynamically. For [\S\s], this wouldn't be necessary. But all character classes are internally treated the same way, so it still happens.

Actions #7

Updated by hsbt (Hiroshi SHIBATA) almost 5 years ago

  • Tags set to regexp, perf

Updated by jeremyevans0 (Jeremy Evans) over 1 year ago

  • Status changed from Open to Closed

Thanks to very impressive work by @makenowjust, this issue has been fixed in Ruby 3.2.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0