Bug #14418
closedruby 2.5 slow regexp execution
Added by jakub.wozny (Kuba W) almost 7 years ago. Updated about 1 year ago.
Description
I have simple regexp that performing very slow.
"fußball "*20 =~ /^([\S\s]{1000})/i
It works fast if I remove /i
flag. I figured out that is also depends on string length or on quantifier value (in this case it is {1000}
).
When you remove ß
form the string it also works fast.
I tested on 2.3.1, 2.4.3 and 2.5.0.
I'm not sure it is a bug or it just works that way.
Updated by jakub.wozny (Kuba W) almost 7 years ago
I can't paste the code here corectly. I creted a gist with regexp: https://gist.github.com/kubaw/60ca998200d80883156fa94efa7eb6fe
Updated by sos4nt (Stefan Schüßler) almost 7 years ago
I can't paste the code here corectly.
You have to insert a blank line before ~~~
Updated by shevegen (Robert A. Heiler) almost 7 years ago
You have to insert a blank line before
I also often just insert four ' ' space characters before the code
I want to add; no idea if it is correctly interpreted but it seems
to work on both github and ruby-lang.org, so I tend to use it. :D
To the regexp performance, I have no idea if it is a bug or not,
but I think either way, it may be helpful to have some test code
that can test different regexps and correlate it with the "expected
speed outcome". That way issue requests like this could help people
before they report a (potential) issue, to see whether everything
works as-is or some kind of bug exists.
Since you use "ß", let me ask you - what encoding do you use within
the script? Possibly UTF-8? Have you tested if some ISO-encoding
makes a difference in regards to speed?
Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).
Updated by jakub.wozny (Kuba W) almost 7 years ago
Ok, Blow is the regexp that I tested. I used utf-8 encodnings at the begining:
"fußball "*20 =~ /([\S\s]{1000})/i
Some measurements:
(0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/i } }
0.000000 0.000000 0.000000 ( 0.000481)
0.000000 0.000000 0.000000 ( 0.000079)
0.000000 0.000000 0.000000 ( 0.000246)
0.000000 0.000000 0.000000 ( 0.000751)
0.010000 0.000000 0.010000 ( 0.002447)
0.000000 0.000000 0.000000 ( 0.006554)
0.010000 0.000000 0.010000 ( 0.007416)
0.020000 0.000000 0.020000 ( 0.022623)
0.070000 0.000000 0.070000 ( 0.066888)
0.200000 0.000000 0.200000 ( 0.196393)
0.590000 0.000000 0.590000 ( 0.591980)
1.770000 0.000000 1.770000 ( 1.772828)
5.290000 0.010000 5.300000 ( 5.292948)
15.860000 0.000000 15.860000 ( 15.868370)
I would expect that this code should work as fast as version without /i
flag.
"fußball "*20 =~ /([\S\s]{1000})/
(0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/ } }
0.000000 0.000000 0.000000 ( 0.000036)
0.000000 0.000000 0.000000 ( 0.000009)
0.000000 0.000000 0.000000 ( 0.000011)
0.000000 0.000000 0.000000 ( 0.000016)
0.000000 0.000000 0.000000 ( 0.000018)
0.000000 0.000000 0.000000 ( 0.000029)
0.000000 0.000000 0.000000 ( 0.000020)
0.000000 0.000000 0.000000 ( 0.000021)
0.000000 0.000000 0.000000 ( 0.000023)
0.000000 0.000000 0.000000 ( 0.000024)
0.000000 0.000000 0.000000 ( 0.000016)
0.000000 0.000000 0.000000 ( 0.000027)
0.000000 0.000000 0.000000 ( 0.000022)
0.000000 0.000000 0.000000 ( 0.000023)
0.000000 0.000000 0.000000 ( 0.000024)
0.000000 0.000000 0.000000 ( 0.000023)
0.000000 0.000000 0.000000 ( 0.000024)
0.000000 0.000000 0.000000 ( 0.000026)
0.000000 0.000000 0.000000 ( 0.000025)
0.000000 0.000000 0.000000 ( 0.000026)
0.000000 0.000000 0.000000 ( 0.000053)
Another test cases:
Benchmark.measure { "ß "*20 =~ /^([\S\s]{20})/i } # 0.000000 0.000000 0.000000 ( 0.000431)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{30})/i } # 0.000000 0.000000 0.000000 ( 0.000427)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{40})/i } # 0.000000 0.000000 0.000000 ( 0.000430)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/i } # too long to wait
#without /i flag:
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/ } #0.000000 0.000000 0.000000 ( 0.000043)
I tested in other encodings:
Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/i}.to_s # => " 3.450000 0.000000 3.450000 ( 3.452036)\n"
In case of other encoding, removing /i
also speeds up:
Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/}.to_s #=> " 0.010000 0.000000 0.010000 ( 0.000514)\n"
Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).
I have multilingual app so I need to stay in unicode.
Updated by nobu (Nobuyoshi Nakada) almost 7 years ago
- Description updated (diff)
FYI, you can avoid it by using .
instead of [\S\s]
.
Updated by duerst (Martin Dürst) almost 7 years ago
What happens essentially when using //i is that every 'ß' in the string (and in the regular expression) is expanded to 'ss', dynamically. For [\S\s], this wouldn't be necessary. But all character classes are internally treated the same way, so it still happens.
Updated by jeremyevans0 (Jeremy Evans) about 1 year ago
- Status changed from Open to Closed
Thanks to very impressive work by @makenowjust, this issue has been fixed in Ruby 3.2.