Misc #20652
openMemory allocation for gsub has increased from Ruby 2.7 to 3.3
Description
I recently upgraded from ruby 2.7.7 to 3.3.1 and noticed that the GC load increased.
When I used the allocation profiler to investigate, I found that memory allocation from gsub had increased.
The problem was code like this:
s = "foo "
s.gsub(/ (\s+)/) { " #{' ' * Regexp.last_match(1).length}" }
When I compared the results of heap-profiler between 2.7.7 and 3.3.1, I found that MatchData was increasing.
https://gist.github.com/orisano/98792dee260106e9b6fcb45bbabeb1e6
https://github.com/ruby/ruby/commit/abc0304cb28cb9dcc3476993bc487884c139fd11
I discovered that the cause is this commit, which stopped reusing backref to avoid race conditions.
Is there a way to reuse backref while still avoiding race conditions?
Updated by mame (Yusuke Endoh) 6 months ago
- Related to Bug #17507: Regexp capture groups ignored sometimes in some multithreaded environments (possible race condition) added
Updated by mame (Yusuke Endoh) 6 months ago
- Assignee set to jeremyevans0 (Jeremy Evans)
@jeremyevans0 (Jeremy Evans) What do you think?
Updated by jeremyevans0 (Jeremy Evans) 6 months ago
Updated by byroot (Jean Boussier) 6 months ago
If anyone can come up with an approach that fixes #17507 without causing an increase in memory usage
I guess I'm a bit surprised by that because I wouldn't have thought the backref is only accessible by the same fiber, but maybe I'm missing something.
But on a more general note, I very very often wish I'd have a way to not have a MatchData
created by methods that take a Regexp, it's very rare you need the MatchData
created by gsub
or Regexp===
. Most of the time it's fine, but when you are trying to optimize a hotspot, it's really something you'd want to be able to skip. It's in big part why Regexp#match?
was added and popularized, but is only usable in a subset of cases.
Maybe we could add a new Regexp flag to turn off this behavior?
e.g.
case str
when /^a/
p $~ # => #<MatchData "a">
when /^b/c
p $~ # => nil
end
Updated by shyouhei (Shyouhei Urabe) 6 months ago
I agree this is counter-intuitive. The #17507 problem was that
i = lambda { ...(touches $~)... }
many.times { Thread.start { many.times { i.call } } }
would break. This is because $~
is i
's local variable. i
is shared across threads so is $~
.
@jeremyevans0 (Jeremy Evans) fixed this so that no two $~
s are identical, no matter multi threaded situation or not.
The fix of course increases memory usage. But because the root cause of the problem is sharing local variables across threads, to unshare them the bloat seems kind of inevitable to me. I also think this is a needed fix. Nobody thinks gsub
modifies $~
behind the scene.
I would agree with @byroot (Jean Boussier) that it could be better if we had a gsub
variant that doesn't touch $~
.
Updated by byroot (Jean Boussier) 6 months ago
it could be better if we had a gsub variant that doesn't touch $~.
Right, but the problem is beyond gsub
, e.g.:
>> "abba"[/(bb|[^b]{2})/]
=> "bb"
>> $~
=> #<MatchData "bb" 1:"bb">
So introducing gsub( backref: false)
or gsub_no_backref
would solve one case but would leave many others.
Hence why I think the only way to handle this general problem without introducing tons of new methods and lots of code churn would be a new Regexp flag.
Updated by Eregon (Benoit Daloze) 6 months ago · Edited
FWIW, what TruffleRuby does for this is to store $~
as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:
https://github.com/oracle/truffleruby/blob/3cd422433deebe3fa664f8c4540811c42ca02e93/src/main/java/org/truffleruby/language/threadlocal/ThreadAndFrameLocalStorage.java
I'm not sure how it works on CRuby, but if $~
is stored directly in the frame then threads might see a different $~
than they expect which could lead to very subtle bugs.
I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~
and some might not (which could lead to a bunch of duplication).
I think in general a good fix to simplify this and avoid this kind of races would be to store $~
in the caller frame (even if that's a block's frame) but not higher.
In this case it would be stored in the lambda
's frame and not outside.
That's also quite a bit faster.
Of course it would be somewhat incompatible, but how much code uses a $~
outside a block when the Regexp call is made inside a block?
We could warn that such code should not rely on that for a release or so, before changing it.
Updated by ko1 (Koichi Sasada) 6 months ago
Eregon (Benoit Daloze) wrote in #note-7:
FWIW, what TruffleRuby does for this is to store
$~
as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:
off-topic:
what does it happen when creating a thread just after storing $~
?
Updated by byroot (Jean Boussier) 6 months ago
I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).
I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.
Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...
Updated by austin (Austin Ziegler) 6 months ago
byroot (Jean Boussier) wrote in #note-9:
I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).
I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.
Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...
It could be a flag only accessible through Regexp.new
(Regexp.new(source, Regexp::OPTIMIZED_MATCHDATA)
) instead of having it with a literal (/…/z
). The ugliness would make it less likely that Rubocop would try to force it on everyone.
Updated by ko1 (Koichi Sasada) 6 months ago · Edited
I found an idea that each thread points to unescaped MatchData rather than $~
and reuse it.
In other words, all generated MatchData will be cached by th->last_matchdata
(or similar) and use it across scopes.
def foo
if /.../ =~ ... # generate MatchData1
bar()
end
end
def bar
m = $str.match(//) # reuse MatchData1
end
It is thread-safe and increases an opportunity to reuse.
Updated by byroot (Jean Boussier) 6 months ago
I don't follow, how can it be re-used in your example? :
def foo
if /foo/ =~ "foo" # generate MatchData1
bar()
p $~ # #<MatchData "all">
end
end
def bar
m = "all".match(/all/) # reuse MatchData1
p $~ # #<MatchData "foo">
end
foo
Updated by byroot (Jean Boussier) 6 months ago
Interestingly the NO_MATCH regexp options was suggested in the ticket that led to Regexp#match?
[Feature #8110]
Updated by Eregon (Benoit Daloze) 6 months ago
ko1 (Koichi Sasada) wrote in #note-8:
what does it happen when creating a thread just after storing
$~
?
It's simply not set in the new thread, and that seems already the same behavior on CRuby:
$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux]
#<MatchData "a">
nil
$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
truffleruby 24.1.0-dev-c2e5209c, like ruby 3.2.2, GraalVM CE Native [x86_64-linux]
#<MatchData "a">
nil
ko1 (Koichi Sasada) wrote in #note-11:
I found an idea that each thread points to unescaped MatchData rather than
$~
and reuse it.
I think that's too incompatible because $~
is frame-local and thread-local, so we need multiple $~
per thread, as @byroot (Jean Boussier) showed.
Updated by Eregon (Benoit Daloze) 6 months ago · Edited
byroot (Jean Boussier) wrote in #note-9:
I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.
I'm thinking cases of Regexps being stored in constants and potentially composed of other regexps/strings, like https://github.com/ruby/uri/blob/master/lib/uri/rfc3986_parser.rb does it for example.
It seems bad to duplicate some of these regexps if (for the same large Regexp) we have some call sites which need the MatchData and some which don't.
Also Regexp#match
(which returns a MatchData) would make no sense with that flag, so it feels the wrong place to specify it.
Regarding gsub
/sub
specifically, I think it shouldn't set $~
, i.e. only set it if a block is passed.
For example with "abc".gsub("a", "d")
I don't see any point to set $~
after it.
That's potential incompatible, but we could warn for a release or so and I'd think few gems rely on that.
Updated by ko1 (Koichi Sasada) 6 months ago
Eregon (Benoit Daloze) wrote in #note-14:
ko1 (Koichi Sasada) wrote in #note-11:
I found an idea that each thread points to unescaped MatchData rather than
$~
and reuse it.I think that's too incompatible because
$~
is frame-local and thread-local, so we need multiple$~
per thread, as @byroot (Jean Boussier) showed.
No. It is not user visible behavior so no incompatiblity.
(don't change $~
)
Updated by byroot (Jean Boussier) 6 months ago
I'm thinking cases of Regexps being stored in constants and potentially composed of other regexps/strings, like https://github.com/ruby/uri/blob/master/lib/uri/rfc3986_parser.rb does it for example.
Sure, there are cases where it wouldn't be convenient. But the thing is, adding this extra flag would only really make a difference in hotspots so I don't mind too much if there are some cases where it's not super convenient.
So I don't think it's a good argument against.
Also Regexp#match (which returns a MatchData) would make no sense with that flag, so it feels the wrong place to specify it.
With the name I suggested, maybe, but with the proper name it would be fine.
Regarding gsub/sub specifically, I think it shouldn't set $~, i.e. only set it if a block is passed.
That has backward compatibility concerns, unlikely to be accepted, and even if it was, the deprecation period would be annoying for not so much gain.
Updated by Dan0042 (Daniel DeLorme) 6 months ago
byroot (Jean Boussier) wrote in #note-4:
Maybe we could add a new Regexp flag to turn off this behavior?
My first reaction was "Yes! This is exactly was we need!" but after thinking more it feels un-rubyish. We shouldn't have to write code to micro-tweak the performance like that. Ideally the interpreter/jit should handle micro-optimizations. I'd be happier to see something like:
def foo(str)
str if str =~ /rx/
#in this method we don't use $~ and friends, so the interpreter doesn't have to allocate MatchData
#yes in theory there's an incompatibility with eval, but in practice I believe that's a non-issue (I'm open to be shown otherwise)
end
shyouhei (Shyouhei Urabe) wrote in #note-5:
Nobody thinks
gsub
modifies$~
behind the scene.
I'm not sure this is what you meant, but I definitely expect ANY regexp operation to modify $~
(except where documented otherwise, like Regexp#match?
)
BTW this includes String#sub; I sometimes write code like: name, prefix = str.sub(/^(the) /i), $1
Updated by Dan0042 (Daniel DeLorme) 6 months ago · Edited
After reading over https://github.com/ruby/ruby/pull/4734/files it seems there's two parts to it.
- use a
set_match
pointer to return the match (this fixes the race condition) - always allocate a MatchData, never using
rb_backref_get()
But it seems to me that #2 is only necessary if set_match
is used. So what about using rb_backref_get()
when possible? Like
match = set_match ? Qnil : rb_backref_get();
Updated by jeremyevans0 (Jeremy Evans) 6 months ago
Dan0042 (Daniel DeLorme) wrote in #note-19:
After reading over https://github.com/ruby/ruby/pull/4734/files it seems there's two parts to it.
- use a
set_match
pointer to return the match (this fixes the race condition)- always allocate a MatchData, never using
rb_backref_get()
But it seems to me that #2 is only necessary if
set_match
is used. So what about usingrb_backref_get()
when possible? Likematch = set_match ? Qnil : rb_backref_get();
I'm not sure it is thread-safe. This would modify a shared backref in code paths where set_match
is NULL
. I haven't audited the related code, so I'm not sure what code calls rb_reg_search0
and rb_reg_match
. Feel free to give it a try and see if passes the test and reduces allocations.
Updated by byroot (Jean Boussier) 6 months ago
I'd be happier to see something like:
@Eregon (Benoit Daloze) given how good TruffleRuby is at escape analysis and such, before I dive into why it wasn't done before, is TruffleRuby able to not create the MatchData when it's not accessed, or is there something in the semantic that make it impossible to predict?
Updated by ko1 (Koichi Sasada) 6 months ago
ko1 (Koichi Sasada) wrote in #note-16:
No. It is not user visible behavior so no incompatiblity.
Sorry my wrong. Please ignore about it.
Updated by byroot (Jean Boussier) 6 months ago
is there something in the semantic that make it impossible to predict?
Answering to myself:
def match
"foo" =~ /f(o)o/
eval("$1")
end
p match
Updated by Eregon (Benoit Daloze) 6 months ago
@byroot (Jean Boussier) It depends in which situation but generally yes it's able to avoid the allocation.
If there is no block around, partial escape analysis avoids the allocation of the Ruby MatchData object as long as it's not leaked/stored globally (like $m = $~
), even if it is accessed in that method.
There might be still be an allocation of the internal data structure representing the group offsets, if the Regexp sometimes matches and sometimes not (but tail duplication can fix this in some cases, e.g. if there is not too much code to duplicate).
If there is 1/multiple block(s) around the regexp match, then $~
is stored in the method's frame and not the block's frame and then it's allocated unless there is a compilation covering the method and inlining everything until that block.
The case I checked is:
def foo
"a" =~ /(a)/
$1
end
loop { foo() }
with cd truffleruby && chruby truffleruby+graalvm-24.0.2 && jt -u ruby graph test.rb
and that shows the only allocation is for a new Ruby String (the return value).
Updated by byroot (Jean Boussier) 6 months ago
Right, so it's not as simple as marking the ISeq as not needing the backref because it doesn't use getspecial
.
I think we could only realistically do it in MRI if we accepted that $~
and such wouldn't be accessible from eval
. Could be worth asking at the developer meeting, but I'd be surprised if it was accepted.
Updated by Dan0042 (Daniel DeLorme) 6 months ago
I'd be surprised if it was accepted.
Same here. Although perhaps I should clarify that $~
would be accessible in eval if also present as a literal in the method.
Updated by hsbt (Hiroshi SHIBATA) 4 months ago
- Status changed from Open to Assigned