Project

General

Profile

Bug #14418

ruby 2.5 slow regexp execution

Added by jakub.wozny (Kuba W) 20 days ago. Updated 19 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:85219]

Description

I have simple regexp that performing very slow.

"fußball "*20 =~ /^([\S\s]{1000})/i

It works fast if I remove /i flag. I figured out that is also depends on string length or on quantifier value (in this case it is {1000}).
When you remove ß form the string it also works fast.

I tested on 2.3.1, 2.4.3 and 2.5.0.

I'm not sure it is a bug or it just works that way.

History

#1 [ruby-core:85222] Updated by jakub.wozny (Kuba W) 20 days ago

I can't paste the code here corectly. I creted a gist with regexp: https://gist.github.com/kubaw/60ca998200d80883156fa94efa7eb6fe

#2 [ruby-core:85228] Updated by sos4nt (Stefan Schüßler) 20 days ago

I can't paste the code here corectly.

You have to insert a blank line before ~~~

#3 [ruby-core:85229] Updated by shevegen (Robert A. Heiler) 20 days ago

You have to insert a blank line before

I also often just insert four ' ' space characters before the code
I want to add; no idea if it is correctly interpreted but it seems
to work on both github and ruby-lang.org, so I tend to use it. :D

To the regexp performance, I have no idea if it is a bug or not,
but I think either way, it may be helpful to have some test code
that can test different regexps and correlate it with the "expected
speed outcome". That way issue requests like this could help people
before they report a (potential) issue, to see whether everything
works as-is or some kind of bug exists.

Since you use "ß", let me ask you - what encoding do you use within
the script? Possibly UTF-8? Have you tested if some ISO-encoding
makes a difference in regards to speed?

Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).

#4 [ruby-core:85232] Updated by jakub.wozny (Kuba W) 20 days ago

Ok, Blow is the regexp that I tested. I used utf-8 encodnings at the begining:

"fußball "*20 =~ /([\S\s]{1000})/i

Some measurements:

 (0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/i } }
  0.000000   0.000000   0.000000 (  0.000481)
  0.000000   0.000000   0.000000 (  0.000079)
  0.000000   0.000000   0.000000 (  0.000246)
  0.000000   0.000000   0.000000 (  0.000751)
  0.010000   0.000000   0.010000 (  0.002447)
  0.000000   0.000000   0.000000 (  0.006554)
  0.010000   0.000000   0.010000 (  0.007416)
  0.020000   0.000000   0.020000 (  0.022623)
  0.070000   0.000000   0.070000 (  0.066888)
  0.200000   0.000000   0.200000 (  0.196393)
  0.590000   0.000000   0.590000 (  0.591980)
  1.770000   0.000000   1.770000 (  1.772828)
  5.290000   0.010000   5.300000 (  5.292948)
 15.860000   0.000000  15.860000 ( 15.868370)

I would expect that this code should work as fast as version without /i flag.

"fußball "*20 =~ /([\S\s]{1000})/

(0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/ } }
  0.000000   0.000000   0.000000 (  0.000036)
  0.000000   0.000000   0.000000 (  0.000009)
  0.000000   0.000000   0.000000 (  0.000011)
  0.000000   0.000000   0.000000 (  0.000016)
  0.000000   0.000000   0.000000 (  0.000018)
  0.000000   0.000000   0.000000 (  0.000029)
  0.000000   0.000000   0.000000 (  0.000020)
  0.000000   0.000000   0.000000 (  0.000021)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000016)
  0.000000   0.000000   0.000000 (  0.000027)
  0.000000   0.000000   0.000000 (  0.000022)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000026)
  0.000000   0.000000   0.000000 (  0.000025)
  0.000000   0.000000   0.000000 (  0.000026)
  0.000000   0.000000   0.000000 (  0.000053)

Another test cases:

Benchmark.measure { "ß "*20 =~ /^([\S\s]{20})/i } # 0.000000   0.000000   0.000000 (  0.000431)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{30})/i } # 0.000000   0.000000   0.000000 (  0.000427)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{40})/i } # 0.000000   0.000000   0.000000 (  0.000430)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/i } # too long to wait

#without /i flag:
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/ } #0.000000   0.000000   0.000000 (  0.000043)

I tested in other encodings:

Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/i}.to_s # => "  3.450000   0.000000   3.450000 (  3.452036)\n"

In case of other encoding, removing /i also speeds up:

Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/}.to_s #=> "  0.010000   0.000000   0.010000 (  0.000514)\n"

Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).

I have multilingual app so I need to stay in unicode.

#5 [ruby-core:85245] Updated by nobu (Nobuyoshi Nakada) 19 days ago

  • Description updated (diff)

FYI, you can avoid it by using . instead of [\S\s].

#6 [ruby-core:85248] Updated by duerst (Martin Dürst) 19 days ago

What happens essentially when using //i is that every 'ß' in the string (and in the regular expression) is expanded to 'ss', dynamically. For [\S\s], this wouldn't be necessary. But all character classes are internally treated the same way, so it still happens.

Also available in: Atom PDF