Bug #9680

String#sub and siblings should not use regex when String pattern is passed

Added by Sam Rawlins over 1 year ago. Updated 12 months ago.

[ruby-core:61706]
Status:Closed
Priority:Normal
Assignee:-
ruby -v:trunk Backport:2.0.0: WONTFIX, 2.1: WONTFIX

Description

Currently String#sub, #sub!, #gsub, and#gsub!all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example,"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:

  • dup of the original String
  • result String
  • 2x ":"<US-ASCII>
  • 2x ":"<ASCII-8BIT>
  • Regexp from pattern: /:/
  • #<MatchData ":">
  • #<MatchData nil>

I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to rb_reg_search() are replaced with calls to a new function, rb_pat_search(), which conditionally calls rb_reg_search() or rb_str_index(), depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.

Runtime of each method is dramatically reduced:

require 'benchmark'

n = 4_000_000
Benchmark.bm(7) do |bm|
  str1 = "123:456"; str2 = "123_456";
  colon = ":"; underscore = "_"
  # each benchmark runs the substring method twice so that the bang methods can
  # perform the same number of substitutions to str1 each go around.
  bm.report("sub")   { n.times { str1.sub(colon, underscore);   str2.sub(underscore, colon) } }
  bm.report("sub!")  { n.times { str1.sub!(colon, underscore);  str1.sub!(underscore, colon) } }
  bm.report("gsub")  { n.times { str1.gsub(colon, underscore);  str2.gsub(underscore, colon) } }
  bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end

# trunk
              user     system      total        real
sub      40.450000   0.580000  41.030000 ( 41.209658)
sub!     39.780000   0.580000  40.360000 ( 40.656789)
gsub     58.500000   0.820000  59.320000 ( 59.603923)
gsub!    59.400000   0.770000  60.170000 ( 60.435687)

# this patch
              user     system      total        real
sub       3.060000   0.010000   3.070000 (  3.091920)
sub!      2.380000   0.010000   2.390000 (  2.390769)
gsub      7.130000   0.130000   7.260000 (  7.299139)
gsub!     7.660000   0.150000   7.810000 (  7.846190)

When using a String pattern, runtime is reduced by 87% to 94%.

There is only one incompatibility that I am aware of: $& will not be set after using a sub method with a String pattern. (Subgroups ($1, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)

In the future, only 3 more methods use the function, get_pat(), that creates a Regexp from the String pattern: #split, #scan, and #match. I think this fix could be applied to these as well.

ruby-579.diff Magnifier (5.12 KB) Sam Rawlins, 03/27/2014 12:09 AM

History

#1 Updated by Eric Wong over 1 year ago

I had an idea to replace the current reg_cache with memoization for
converting string literals, but never got around to it. That would
also reduce garbage while preserving $& compatibility.

#2 Updated by Sam Rawlins over 1 year ago

I think the speedup in this patch comes almost entirely from skipping the regex engine, not from the GC savings.

Preserving $& (and $~ and friends) while still not firing up the regex engine might be possible (constructing the basic backref, with no subgroups, by hand), but very very ugly code (an RMatch has an RRegexp and an rmatch which has a re_registers, etc). This might only a ~20 line function, but feels so dirty...

I think an improvement (or replacement) to reg_cache would also be welcome.

#3 Updated by Benoit Daloze over 1 year ago

It would be interesting to run the benchmark on a more realistic example. One should use String#tr or String#tr! if it is only to replace a single character.

#4 Updated by Sam Rawlins over 1 year ago

Good point, Benoit! This is actually why I started looking into #gsub in the first place. I benchmarked ActiveSupport::Inflector [1], which does operations like gsub!('/', '::') and gsub('::', '/'). Here are the benchmarks, before and after Nobu's patch, based on my patch:

BEFORE
                   user     system      total        real
#underscore   24.000000   0.160000  24.160000 ( 24.254921)
#camelize      3.040000   0.010000   3.050000 (  3.060907)

AFTER
                   user     system      total        real
#underscore   23.690000   0.160000  23.850000 ( 24.012497)
#camelize      2.680000   0.010000   2.690000 (  2.706418)

So #underscore is 1% faster; #camelize is 11% faster.

I also benchmarked Psych with YAML.dump(["one string", "two string"]):

                       user     system      total        real
YAML.dump BEFORE  11.380000   0.250000  11.630000 ( 11.680545)
YAML.dump AFTER   11.030000   0.240000  11.270000 ( 11.313147)

The patch seems to shave 3% off here. Take all of these benchmarks with a grain of salt :)

[1] https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/methods.rb

#5 Updated by Nobuyoshi Nakada over 1 year ago

  • Status changed from Open to Closed

I missed to include this ticket reference in the commit log.

Inflector seems using other replacements with RE, so this improvement might not be significant much.

#6 Updated by Usaku NAKAMURA about 1 year ago

  • Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: UNKNOWN

#7 Updated by Tomoyuki Chikanaga 12 months ago

  • Backport changed from 2.0.0: WONTFIX, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: WONTFIX

Also available in: Atom PDF