Project

General

Profile

Bug #17030

Enumerable#grep{_v} should be optimized for Regexp

Added by marcandre (Marc-Andre Lafortune) 25 days ago. Updated 18 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:99156]

Description

Currently:

array.select { |e| e.match?(REGEXP) }

# about 3x faster and 6x more memory efficient than
array.grep(REGEXP)

This is because grep calls Regexp#=== which creates useless MatchData

Updated by marcandre (Marc-Andre Lafortune) 25 days ago

Code to reproduce by @fatkodima:

require 'benchmark-ips'
require 'benchmark-memory'

arr = %w[foobar foobaz bazquux hello world just making this array longer]

REGEXP = /o/

def select_match(arr)
  arr.select { |e| e.match?(REGEXP) }
end

def grep(arr)
  arr.grep(REGEXP)
end

Benchmark.ips do |x|
  x.report("select.match?")  { select_match(arr) }
  x.report("grep")           { grep(arr) }
  x.compare!
end

puts "********* MEMORY *********"

Benchmark.memory do |x|
  x.report("select.match?")  { select_match(arr) }
  x.report("grep")           { grep(arr) }
  x.compare!
end

Updated by shyouhei (Shyouhei Urabe) 25 days ago

Yes but E#grep's allocating MatchData is by spec. You can observe $& etc. by passing a block to it.

p %w[q w e r].grep(/./) { $~ }

So this is at least a breaking change.

Updated by marcandre (Marc-Andre Lafortune) 25 days ago

You are right for grep with a block, we can't necessarily optimize, but we should optimize grep without a block, no?

Updated by nobu (Nobuyoshi Nakada) 25 days ago

Even without a block, grep sets $~ to the last match result.

Updated by Eregon (Benoit Daloze) 24 days ago

nobu (Nobuyoshi Nakada) wrote in #note-4:

Even without a block, grep sets $~ to the last match result.

I guess cases using $~ after the call to grep are very rare (notably because only the last match of the Enumerable would be accessible).

So I would suggest not setting $~ for grep without block.

Updated by marcandre (Marc-Andre Lafortune) 24 days ago

nobu (Nobuyoshi Nakada) wrote in #note-4:

Even without a block, grep sets $~ to the last match result.

I agree with Eregon (Benoit Daloze), doesn't seem like it makes much sense to use that.

There's also no valid reason it should set $~ for all the matches tested before the last one.

Updated by scivola20 (sciv ola) 18 days ago

I have an idea to solve it without any compatibility problem.

[1] Introduce such a Regexp object that === method is same as match?.
[2] Introduce regexp literal option that makes the Regexp object as [1].

If the option is 'f', we can write as /o/f, and grep(/o/f) is faster than grep(/o/).

This speed up not only grep but also all?, any?, case and so on.

Many people have written like this:

IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end

This is slow because of the above mentioned problem.
Replacing /^#/ with /^#/f, and /yyy/ with /yyy/f will make it faster.

Also available in: Atom PDF