Feature #17206
openIntroduce new Regexp option to avoid global MatchData allocations
Description
Originates from https://bugs.ruby-lang.org/issues/17030
When this option is specified, ruby will not create global MatchData objects, when not explicitly needed by the method.
If the new option is named f, we can write as /o/f, and grep(/o/f) is faster than grep(/o/).
This speeds up not only grep, but also all?, any?, case and so on.
Many people have written code like this:
IO.foreach("foo.txt") do |line|
case line
when /^#/
# do nothing
when /^(\d+)/
# using $1
when /xxx/
# using $&
when /yyy/
# not using $&
else
# ...
end
end
This is slow, because of the above mentioned problem.
Replacing /^#/ with /^#/f, and /yyy/ with /yyy/f will make it faster.
Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show 2.5x to 5x speedup.
Updated by znz (Kazuhiro NISHIYAMA) about 5 years ago
What does regexp_without_matchdata.match(string) return when matched?
Updated by fatkodima (Dima Fatko) about 5 years ago
znz (Kazuhiro NISHIYAMA) wrote in #note-1:
What does
regexp_without_matchdata.match(string)return when matched?
Thats what when not explicitly needed by the method. part was about: it returns MatchData in this case, as requested.
Updated by fatkodima (Dima Fatko) about 5 years ago
- Subject changed from Introduce new Regexp option to avoid MatchData allocation to Introduce new Regexp option to avoid global MatchData allocations
Updated by Eregon (Benoit Daloze) about 5 years ago
IMHO hardcoding such knowledge in the pattern feels wrong (vs in the matching method like Regexp#match? which is fine).
It seems to me that it could cause confusing bugs, e.g. when using /f in the case above if a when clause starts to use one of the $~-derived variables.
Then it would unexpectedly always be nil, causing a potentially very subtle bug.
I have a hard time to believe that allocating the MatchData is so expensive.
If that's the case, then there must be a lot of optimization potential for faster allocation of MatchData in CRuby.
What I think rather is this is due to having to set $~ in the caller, and maybe to compute group offsets.
I think it would be worth investigating more in details where does the performance overhead from $~ & friends come from in CRuby.
Updated by scivola20 (sciv ola) about 5 years ago
I believe that people who can use match? and match methods properly, can use this new Regexp option properly.
By the way, the total size of $` , $&, $' equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.
Updated by Eregon (Benoit Daloze) about 5 years ago
scivola20 (sciv ola) wrote in #note-5:
I believe that people who can use
match?andmatchmethods properly, can use this new Regexp option properly.
I disagree, match? is clear, I think =~ suddenly not setting $~ would be a frequent source of bugs.
By the way, the total size of
$`,$&,$'equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.
They are all based on $~, isn't it?
I think they only need a copy-on-write copy of the source string (to avoid later mutations affecting them) + the matched offsets.
At least that's what happens in TruffleRuby.
Updated by Eregon (Benoit Daloze) about 5 years ago
I took a quick look, the logic to set $~ is here:
https://github.com/ruby/ruby/blob/148961adcd0704d964fce920330a6301b9704c25/re.c#L1608-L1623
It does not seem so expensive, but the region is allocated which xmalloc() which is probably not so cheap (there is also a rb_gc() call in there, hopefully it's not hit in practice).
rb_backref_set() goes through a few indirections (it needs to reach the caller frame typically), but it does not seem too expensive either.
I think it would be valuable to investigate further what's actually expensive for setting $~ and how can that be optimized.
A hacky Regexp flag to manually optimize match/=~/=== calls doesn't seem a good way to me.
The caller code knows if it needs $~, etc, not the Regexp literal.
Updated by scivola20 (sciv ola) about 5 years ago
Sorry. “a huge amount of String garbage” is my misunderstanding.
But I don’t know under what situation this option may cause a bug.