Project

General

Profile

Feature #17206

Introduce new Regexp option to avoid global MatchData allocations

Added by fatkodima (Dima Fatko) 2 months ago. Updated about 1 month ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:100239]

Description

Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global MatchData objects, when not explicitly needed by the method.

If the new option is named f, we can write as /o/f, and grep(/o/f) is faster than grep(/o/).

This speeds up not only grep, but also all?, any?, case and so on.

Many people have written code like this:

IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end

This is slow, because of the above mentioned problem.
Replacing /^#/ with /^#/f, and /yyy/ with /yyy/f will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show 2.5x to 5x speedup.

PR: https://github.com/ruby/ruby/pull/3455

Updated by znz (Kazuhiro NISHIYAMA) 2 months ago

What does regexp_without_matchdata.match(string) return when matched?

Updated by fatkodima (Dima Fatko) 2 months ago

znz (Kazuhiro NISHIYAMA) wrote in #note-1:

What does regexp_without_matchdata.match(string) return when matched?

Thats what when not explicitly needed by the method. part was about: it returns MatchData in this case, as requested.

#3

Updated by fatkodima (Dima Fatko) 2 months ago

  • Subject changed from Introduce new Regexp option to avoid MatchData allocation to Introduce new Regexp option to avoid global MatchData allocations

Updated by Eregon (Benoit Daloze) 2 months ago

IMHO hardcoding such knowledge in the pattern feels wrong (vs in the matching method like Regexp#match? which is fine).
It seems to me that it could cause confusing bugs, e.g. when using /f in the case above if a when clause starts to use one of the $~-derived variables.
Then it would unexpectedly always be nil, causing a potentially very subtle bug.

I have a hard time to believe that allocating the MatchData is so expensive.
If that's the case, then there must be a lot of optimization potential for faster allocation of MatchData in CRuby.
What I think rather is this is due to having to set $~ in the caller, and maybe to compute group offsets.

I think it would be worth investigating more in details where does the performance overhead from $~ & friends come from in CRuby.

Updated by scivola20 (sciv ola) about 1 month ago

I believe that people who can use match? and match methods properly, can use this new Regexp option properly.

By the way, the total size of $`, $&, $' equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.

Updated by Eregon (Benoit Daloze) about 1 month ago

scivola20 (sciv ola) wrote in #note-5:

I believe that people who can use match? and match methods properly, can use this new Regexp option properly.

I disagree, match? is clear, I think =~ suddenly not setting $~ would be a frequent source of bugs.

By the way, the total size of $`, $&, $' equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.

They are all based on $~, isn't it?
I think they only need a copy-on-write copy of the source string (to avoid later mutations affecting them) + the matched offsets.
At least that's what happens in TruffleRuby.

Updated by Eregon (Benoit Daloze) about 1 month ago

I took a quick look, the logic to set $~ is here:
https://github.com/ruby/ruby/blob/148961adcd0704d964fce920330a6301b9704c25/re.c#L1608-L1623

It does not seem so expensive, but the region is allocated which xmalloc() which is probably not so cheap (there is also a rb_gc() call in there, hopefully it's not hit in practice).
rb_backref_set() goes through a few indirections (it needs to reach the caller frame typically), but it does not seem too expensive either.
I think it would be valuable to investigate further what's actually expensive for setting $~ and how can that be optimized.

A hacky Regexp flag to manually optimize match/=~/=== calls doesn't seem a good way to me.
The caller code knows if it needs $~, etc, not the Regexp literal.

Updated by scivola20 (sciv ola) about 1 month ago

Sorry. “a huge amount of String garbage” is my misunderstanding.

But I don’t know under what situation this option may cause a bug.

Also available in: Atom PDF