Bug #14418: ruby 2.5 slow regexp execution - Ruby - Ruby Issue Tracking System

Custom queries

Backport 3.2
Backport 3.3
Backport 3.4
bugs: unassigned
DevMeeting
matz
Open issues with attachment
Windows

Actions

Copy link

Bug #14418

closed

ruby 2.5 slow regexp execution

Added by jakub.wozny (Kuba W) over 7 years ago. Updated almost 2 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

2.5

Backport:

2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN

[ruby-core:85219]

Tags:

regexp, perf

Description

I have simple regexp that performing very slow.

"fußball "*20 =~ /^([\S\s]{1000})/i

It works fast if I remove /i flag. I figured out that is also depends on string length or on quantifier value (in this case it is {1000}).
When you remove ß form the string it also works fast.

I tested on 2.3.1, 2.4.3 and 2.5.0.

I'm not sure it is a bug or it just works that way.

History
Notes
Property changes

Actions

Copy link

#1 [ruby-core:85222]

Updated by jakub.wozny (Kuba W) over 7 years ago

I can't paste the code here corectly. I creted a gist with regexp: https://gist.github.com/kubaw/60ca998200d80883156fa94efa7eb6fe

Actions

Copy link

#2 [ruby-core:85228]

Updated by sos4nt (Stefan Schüßler) over 7 years ago

I can't paste the code here corectly.

You have to insert a blank line before ~~~

Actions

Copy link

#3 [ruby-core:85229]

Updated by shevegen (Robert A. Heiler) over 7 years ago

You have to insert a blank line before

I also often just insert four ' ' space characters before the code
I want to add; no idea if it is correctly interpreted but it seems
to work on both github and ruby-lang.org, so I tend to use it. :D

To the regexp performance, I have no idea if it is a bug or not,
but I think either way, it may be helpful to have some test code
that can test different regexps and correlate it with the "expected
speed outcome". That way issue requests like this could help people
before they report a (potential) issue, to see whether everything
works as-is or some kind of bug exists.

Since you use "ß", let me ask you - what encoding do you use within
the script? Possibly UTF-8? Have you tested if some ISO-encoding
makes a difference in regards to speed?

Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).

Actions

Copy link

#4 [ruby-core:85232]

Updated by jakub.wozny (Kuba W) over 7 years ago

Ok, Blow is the regexp that I tested. I used utf-8 encodnings at the begining:

"fußball "*20 =~ /([\S\s]{1000})/i

Some measurements:

 (0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/i } }
  0.000000   0.000000   0.000000 (  0.000481)
  0.000000   0.000000   0.000000 (  0.000079)
  0.000000   0.000000   0.000000 (  0.000246)
  0.000000   0.000000   0.000000 (  0.000751)
  0.010000   0.000000   0.010000 (  0.002447)
  0.000000   0.000000   0.000000 (  0.006554)
  0.010000   0.000000   0.010000 (  0.007416)
  0.020000   0.000000   0.020000 (  0.022623)
  0.070000   0.000000   0.070000 (  0.066888)
  0.200000   0.000000   0.200000 (  0.196393)
  0.590000   0.000000   0.590000 (  0.591980)
  1.770000   0.000000   1.770000 (  1.772828)
  5.290000   0.010000   5.300000 (  5.292948)
 15.860000   0.000000  15.860000 ( 15.868370)

I would expect that this code should work as fast as version without /i flag.

"fußball "*20 =~ /([\S\s]{1000})/

(0..20).each { |n| puts Benchmark.measure { "fußball "*n =~ /^([\S\s]{1000})/ } }
  0.000000   0.000000   0.000000 (  0.000036)
  0.000000   0.000000   0.000000 (  0.000009)
  0.000000   0.000000   0.000000 (  0.000011)
  0.000000   0.000000   0.000000 (  0.000016)
  0.000000   0.000000   0.000000 (  0.000018)
  0.000000   0.000000   0.000000 (  0.000029)
  0.000000   0.000000   0.000000 (  0.000020)
  0.000000   0.000000   0.000000 (  0.000021)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000016)
  0.000000   0.000000   0.000000 (  0.000027)
  0.000000   0.000000   0.000000 (  0.000022)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000023)
  0.000000   0.000000   0.000000 (  0.000024)
  0.000000   0.000000   0.000000 (  0.000026)
  0.000000   0.000000   0.000000 (  0.000025)
  0.000000   0.000000   0.000000 (  0.000026)
  0.000000   0.000000   0.000000 (  0.000053)

Another test cases:

Benchmark.measure { "ß "*20 =~ /^([\S\s]{20})/i } # 0.000000   0.000000   0.000000 (  0.000431)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{30})/i } # 0.000000   0.000000   0.000000 (  0.000427)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{40})/i } # 0.000000   0.000000   0.000000 (  0.000430)
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/i } # too long to wait

#without /i flag:
Benchmark.measure { "ß "*20 =~ /^([\S\s]{50})/ } #0.000000   0.000000   0.000000 (  0.000043)

I tested in other encodings:

Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/i}.to_s # => "  3.450000   0.000000   3.450000 (  3.452036)\n"

In case of other encoding, removing /i also speeds up:

Benchmark.measure{("fußball ".encode("ISO-8859-1"))*20 =~ /([\S\s]{1000})/}.to_s #=> "  0.010000   0.000000   0.010000 (  0.000514)\n"

Reason I ask mostly is because I assume you output german text and
the german umlauts are one huge reason for me to prefer ISO encoding
(due to it being simpler for me to handle with it in a project, as
opposed to Unicode variants).

I have multilingual app so I need to stay in unicode.

Actions

Copy link

#5 [ruby-core:85245]

Updated by nobu (Nobuyoshi Nakada) over 7 years ago

Description updated (diff)

FYI, you can avoid it by using . instead of [\S\s].

Actions

Copy link

#6 [ruby-core:85248]

Updated by duerst (Martin Dürst) over 7 years ago

What happens essentially when using //i is that every 'ß' in the string (and in the regular expression) is expanded to 'ss', dynamically. For [\S\s], this wouldn't be necessary. But all character classes are internally treated the same way, so it still happens.

Actions

Copy link

Updated by hsbt (Hiroshi SHIBATA) over 5 years ago

Tags set to regexp, perf

Actions

Copy link

#8 [ruby-core:114468]

Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago

Status changed from Open to Closed

Thanks to very impressive work by @makenowjust, this issue has been fixed in Ruby 3.2.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #14418

ruby 2.5 slow regexp execution

Updated by jakub.wozny (Kuba W) over 7 years ago

Updated by sos4nt (Stefan Schüßler) over 7 years ago

Updated by shevegen (Robert A. Heiler) over 7 years ago

Updated by jakub.wozny (Kuba W) over 7 years ago

Updated by nobu (Nobuyoshi Nakada) over 7 years ago

Updated by duerst (Martin Dürst) over 7 years ago

Updated by hsbt (Hiroshi SHIBATA) over 5 years ago

Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago