Project

General

Profile

Actions

Bug #9680

closed

String#sub and siblings should not use regex when String pattern is passed

Added by srawlins (Sam Rawlins) about 10 years ago. Updated over 9 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
trunk
[ruby-core:61706]

Description

Currently String#sub, #sub!, #gsub, and #gsub!all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example,"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:

  • dup of the original String
  • result String
  • 2x ":"<US-ASCII>
  • 2x ":"<ASCII-8BIT>
  • Regexp from pattern: /:/
  • #<MatchData ":">
  • #<MatchData nil>

I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to rb_reg_search() are replaced with calls to a new function, rb_pat_search(), which conditionally calls rb_reg_search() or rb_str_index(), depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.

Runtime of each method is dramatically reduced:

require 'benchmark'

n = 4_000_000
Benchmark.bm(7) do |bm|
  str1 = "123:456"; str2 = "123_456";
  colon = ":"; underscore = "_"
  # each benchmark runs the substring method twice so that the bang methods can
  # perform the same number of substitutions to str1 each go around.
  bm.report("sub")   { n.times { str1.sub(colon, underscore);   str2.sub(underscore, colon) } }
  bm.report("sub!")  { n.times { str1.sub!(colon, underscore);  str1.sub!(underscore, colon) } }
  bm.report("gsub")  { n.times { str1.gsub(colon, underscore);  str2.gsub(underscore, colon) } }
  bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end

# trunk
              user     system      total        real
sub      40.450000   0.580000  41.030000 ( 41.209658)
sub!     39.780000   0.580000  40.360000 ( 40.656789)
gsub     58.500000   0.820000  59.320000 ( 59.603923)
gsub!    59.400000   0.770000  60.170000 ( 60.435687)

# this patch
              user     system      total        real
sub       3.060000   0.010000   3.070000 (  3.091920)
sub!      2.380000   0.010000   2.390000 (  2.390769)
gsub      7.130000   0.130000   7.260000 (  7.299139)
gsub!     7.660000   0.150000   7.810000 (  7.846190)

When using a String pattern, runtime is reduced by 87% to 94%.

There is only one incompatibility that I am aware of: $& will not be set after using a sub method with a String pattern. (Subgroups ($1, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)

In the future, only 3 more methods use the function, get_pat(), that creates a Regexp from the String pattern: #split, #scan, and #match. I think this fix could be applied to these as well.


Files

ruby-579.diff (5.12 KB) ruby-579.diff srawlins (Sam Rawlins), 03/27/2014 12:09 AM
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0