Misc #20652
openMemory allocation for gsub has increased from Ruby 2.7 to 3.3
Added by orisano (Nao Yonashiro) 2 days ago. Updated about 5 hours ago.
Description
I recently upgraded from ruby 2.7.7 to 3.3.1 and noticed that the GC load increased.
When I used the allocation profiler to investigate, I found that memory allocation from gsub had increased.
The problem was code like this:
s = "foo "
s.gsub(/ (\s+)/) { " #{' ' * Regexp.last_match(1).length}" }
When I compared the results of heap-profiler between 2.7.7 and 3.3.1, I found that MatchData was increasing.
https://gist.github.com/orisano/98792dee260106e9b6fcb45bbabeb1e6
https://github.com/ruby/ruby/commit/abc0304cb28cb9dcc3476993bc487884c139fd11
I discovered that the cause is this commit, which stopped reusing backref to avoid race conditions.
Is there a way to reuse backref while still avoiding race conditions?
Updated by mame (Yusuke Endoh) 2 days ago
- Related to Bug #17507: Regexp capture groups ignored sometimes in some multithreaded environments (possible race condition) added
Updated by mame (Yusuke Endoh) 2 days ago
- Assignee set to jeremyevans0 (Jeremy Evans)
@jeremyevans0 (Jeremy Evans) What do you think?
Updated by jeremyevans0 (Jeremy Evans) 2 days ago
Updated by byroot (Jean Boussier) 1 day ago
If anyone can come up with an approach that fixes #17507 without causing an increase in memory usage
I guess I'm a bit surprised by that because I wouldn't have thought the backref is only accessible by the same fiber, but maybe I'm missing something.
But on a more general note, I very very often wish I'd have a way to not have a MatchData
created by methods that take a Regexp, it's very rare you need the MatchData
created by gsub
or Regexp===
. Most of the time it's fine, but when you are trying to optimize a hotspot, it's really something you'd want to be able to skip. It's in big part why Regexp#match?
was added and popularized, but is only usable in a subset of cases.
Maybe we could add a new Regexp flag to turn off this behavior?
e.g.
case str
when /^a/
p $~ # => #<MatchData "a">
when /^b/c
p $~ # => nil
end
Updated by shyouhei (Shyouhei Urabe) 1 day ago
I agree this is counter-intuitive. The #17507 problem was that
i = lambda { ...(touches $~)... }
many.times { Thread.start { many.times { i.call } } }
would break. This is because $~
is i
's local variable. i
is shared across threads so is $~
.
@jeremyevans0 (Jeremy Evans) fixed this so that no two $~
s are identical, no matter multi threaded situation or not.
The fix of course increases memory usage. But because the root cause of the problem is sharing local variables across threads, to unshare them the bloat seems kind of inevitable to me. I also think this is a needed fix. Nobody thinks gsub
modifies $~
behind the scene.
I would agree with @byroot (Jean Boussier) that it could be better if we had a gsub
variant that doesn't touch $~
.
Updated by byroot (Jean Boussier) 1 day ago
it could be better if we had a gsub variant that doesn't touch $~.
Right, but the problem is beyond gsub
, e.g.:
>> "abba"[/(bb|[^b]{2})/]
=> "bb"
>> $~
=> #<MatchData "bb" 1:"bb">
So introducing gsub( backref: false)
or gsub_no_backref
would solve one case but would leave many others.
Hence why I think the only way to handle this general problem without introducing tons of new methods and lots of code churn would be a new Regexp flag.
Updated by Eregon (Benoit Daloze) 1 day ago · Edited
FWIW, what TruffleRuby does for this is to store $~
as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:
https://github.com/oracle/truffleruby/blob/3cd422433deebe3fa664f8c4540811c42ca02e93/src/main/java/org/truffleruby/language/threadlocal/ThreadAndFrameLocalStorage.java
I'm not sure how it works on CRuby, but if $~
is stored directly in the frame then threads might see a different $~
than they expect which could lead to very subtle bugs.
I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~
and some might not (which could lead to a bunch of duplication).
I think in general a good fix to simplify this and avoid this kind of races would be to store $~
in the caller frame (even if that's a block's frame) but not higher.
In this case it would be stored in the lambda
's frame and not outside.
That's also quite a bit faster.
Of course it would be somewhat incompatible, but how much code uses a $~
outside a block when the Regexp call is made inside a block?
We could warn that such code should not rely on that for a release or so, before changing it.
Updated by ko1 (Koichi Sasada) 1 day ago
Eregon (Benoit Daloze) wrote in #note-7:
FWIW, what TruffleRuby does for this is to store
$~
as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:
off-topic:
what does it happen when creating a thread just after storing $~
?
Updated by byroot (Jean Boussier) 1 day ago
I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).
I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.
Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...
Updated by austin (Austin Ziegler) 1 day ago
byroot (Jean Boussier) wrote in #note-9:
I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).
I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.
Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...
It could be a flag only accessible through Regexp.new
(Regexp.new(source, Regexp::OPTIMIZED_MATCHDATA)
) instead of having it with a literal (/…/z
). The ugliness would make it less likely that Rubocop would try to force it on everyone.
Updated by ko1 (Koichi Sasada) about 23 hours ago · Edited
I found an idea that each thread points to unescaped MatchData rather than $~
and reuse it.
In other words, all generated MatchData will be cached by th->last_matchdata
(or similar) and use it across scopes.
def foo
if /.../ =~ ... # generate MatchData1
bar()
end
end
def bar
m = $str.match(//) # reuse MatchData1
end
It is thread-safe and increases an opportunity to reuse.
Updated by byroot (Jean Boussier) about 9 hours ago
I don't follow, how can it be re-used in your example? :
def foo
if /foo/ =~ "foo" # generate MatchData1
bar()
p $~ # #<MatchData "all">
end
end
def bar
m = "all".match(/all/) # reuse MatchData1
p $~ # #<MatchData "foo">
end
foo
Updated by byroot (Jean Boussier) about 9 hours ago
Interestingly the NO_MATCH regexp options was suggested in the ticket that led to Regexp#match?
[Feature #8110]
Updated by Eregon (Benoit Daloze) about 5 hours ago
ko1 (Koichi Sasada) wrote in #note-8:
what does it happen when creating a thread just after storing
$~
?
It's simply not set in the new thread, and that seems already the same behavior on CRuby:
$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux]
#<MatchData "a">
nil
$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
truffleruby 24.1.0-dev-c2e5209c, like ruby 3.2.2, GraalVM CE Native [x86_64-linux]
#<MatchData "a">
nil
ko1 (Koichi Sasada) wrote in #note-11:
I found an idea that each thread points to unescaped MatchData rather than
$~
and reuse it.
I think that's too incompatible because $~
is frame-local and thread-local, so we need multiple $~
per thread, as @byroot (Jean Boussier) showed.
Updated by Eregon (Benoit Daloze) about 5 hours ago · Edited
byroot (Jean Boussier) wrote in #note-9:
I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.
I'm thinking cases of Regexps being stored in constants and potentially composed of other regexps/strings, like https://github.com/ruby/uri/blob/master/lib/uri/rfc3986_parser.rb does it for example.
It seems bad to duplicate some of these regexps if (for the same large Regexp) we have some call sites which need the MatchData and some which don't.
Also Regexp#match
(which returns a MatchData) would make no sense with that flag, so it feels the wrong place to specify it.
Regarding gsub
/sub
specifically, I think it shouldn't set $~
, i.e. only set it if a block is passed.
For example with "abc".gsub("a", "d")
I don't see any point to set $~
after it.
That's potential incompatible, but we could warn for a release or so and I'd think few gems rely on that.