Project

General

Profile

Actions

Misc #20652

open

Memory allocation for gsub has increased from Ruby 2.7 to 3.3

Added by orisano (Nao Yonashiro) 2 days ago. Updated about 6 hours ago.


Description

I recently upgraded from ruby 2.7.7 to 3.3.1 and noticed that the GC load increased.
When I used the allocation profiler to investigate, I found that memory allocation from gsub had increased.

The problem was code like this:

s = "foo              "
s.gsub(/ (\s+)/) { " #{' ' * Regexp.last_match(1).length}" }

When I compared the results of heap-profiler between 2.7.7 and 3.3.1, I found that MatchData was increasing.

https://gist.github.com/orisano/98792dee260106e9b6fcb45bbabeb1e6

https://github.com/ruby/ruby/commit/abc0304cb28cb9dcc3476993bc487884c139fd11

I discovered that the cause is this commit, which stopped reusing backref to avoid race conditions.
Is there a way to reuse backref while still avoiding race conditions?


Related issues 1 (0 open1 closed)

Related to Ruby master - Bug #17507: Regexp capture groups ignored sometimes in some multithreaded environments (possible race condition)ClosedActions
Actions #1

Updated by mame (Yusuke Endoh) 2 days ago

  • Related to Bug #17507: Regexp capture groups ignored sometimes in some multithreaded environments (possible race condition) added

Updated by mame (Yusuke Endoh) 2 days ago

  • Assignee set to jeremyevans0 (Jeremy Evans)

@jeremyevans0 (Jeremy Evans) What do you think?

Updated by jeremyevans0 (Jeremy Evans) 2 days ago

It's expected that fixing #17507 caused memory usage to increase. If anyone can come up with an approach that fixes #17507 without causing an increase in memory usage, please submit a pull request for it.

Updated by byroot (Jean Boussier) 1 day ago

If anyone can come up with an approach that fixes #17507 without causing an increase in memory usage

I guess I'm a bit surprised by that because I wouldn't have thought the backref is only accessible by the same fiber, but maybe I'm missing something.

But on a more general note, I very very often wish I'd have a way to not have a MatchData created by methods that take a Regexp, it's very rare you need the MatchData created by gsub or Regexp===. Most of the time it's fine, but when you are trying to optimize a hotspot, it's really something you'd want to be able to skip. It's in big part why Regexp#match? was added and popularized, but is only usable in a subset of cases.

Maybe we could add a new Regexp flag to turn off this behavior?

e.g.

case str
when /^a/
  p $~ # => #<MatchData "a">
when /^b/c
  p $~ # => nil
end

Updated by shyouhei (Shyouhei Urabe) 1 day ago

I agree this is counter-intuitive. The #17507 problem was that

i = lambda { ...(touches $~)... }

many.times { Thread.start { many.times { i.call } } }

would break. This is because $~ is i's local variable. i is shared across threads so is $~.

@jeremyevans0 (Jeremy Evans) fixed this so that no two $~s are identical, no matter multi threaded situation or not.

The fix of course increases memory usage. But because the root cause of the problem is sharing local variables across threads, to unshare them the bloat seems kind of inevitable to me. I also think this is a needed fix. Nobody thinks gsub modifies $~ behind the scene.

I would agree with @byroot (Jean Boussier) that it could be better if we had a gsub variant that doesn't touch $~.

Updated by byroot (Jean Boussier) 1 day ago

it could be better if we had a gsub variant that doesn't touch $~.

Right, but the problem is beyond gsub, e.g.:

>> "abba"[/(bb|[^b]{2})/]
=> "bb"
>> $~
=> #<MatchData "bb" 1:"bb">

So introducing gsub( backref: false) or gsub_no_backref would solve one case but would leave many others.

Hence why I think the only way to handle this general problem without introducing tons of new methods and lots of code churn would be a new Regexp flag.

Updated by Eregon (Benoit Daloze) 1 day ago · Edited

FWIW, what TruffleRuby does for this is to store $~ as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:
https://github.com/oracle/truffleruby/blob/3cd422433deebe3fa664f8c4540811c42ca02e93/src/main/java/org/truffleruby/language/threadlocal/ThreadAndFrameLocalStorage.java

I'm not sure how it works on CRuby, but if $~ is stored directly in the frame then threads might see a different $~ than they expect which could lead to very subtle bugs.

I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).

I think in general a good fix to simplify this and avoid this kind of races would be to store $~ in the caller frame (even if that's a block's frame) but not higher.
In this case it would be stored in the lambda's frame and not outside.
That's also quite a bit faster.
Of course it would be somewhat incompatible, but how much code uses a $~ outside a block when the Regexp call is made inside a block?
We could warn that such code should not rely on that for a release or so, before changing it.

Updated by ko1 (Koichi Sasada) 1 day ago

Eregon (Benoit Daloze) wrote in #note-7:

FWIW, what TruffleRuby does for this is to store $~ as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:

off-topic:

what does it happen when creating a thread just after storing $~?

Updated by byroot (Jean Boussier) 1 day ago

I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).

I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.

Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...

Updated by austin (Austin Ziegler) 1 day ago

byroot (Jean Boussier) wrote in #note-9:

I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).

I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.

Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...

It could be a flag only accessible through Regexp.new (Regexp.new(source, Regexp::OPTIMIZED_MATCHDATA)) instead of having it with a literal (/…/z). The ugliness would make it less likely that Rubocop would try to force it on everyone.

Updated by ko1 (Koichi Sasada) about 24 hours ago · Edited

I found an idea that each thread points to unescaped MatchData rather than $~ and reuse it.
In other words, all generated MatchData will be cached by th->last_matchdata (or similar) and use it across scopes.

def foo
  if /.../ =~ ... # generate MatchData1
    bar()
  end
end

def bar
  m = $str.match(//) # reuse MatchData1
end

It is thread-safe and increases an opportunity to reuse.

Updated by byroot (Jean Boussier) about 10 hours ago

I don't follow, how can it be re-used in your example? :

def foo
  if /foo/ =~ "foo" # generate MatchData1
    bar()
    p $~ # #<MatchData "all">
  end
end

def bar
  m = "all".match(/all/) # reuse MatchData1
  p $~ # #<MatchData "foo">
end

foo

Updated by byroot (Jean Boussier) about 10 hours ago

Interestingly the NO_MATCH regexp options was suggested in the ticket that led to Regexp#match? [Feature #8110]

Updated by Eregon (Benoit Daloze) about 6 hours ago

ko1 (Koichi Sasada) wrote in #note-8:

what does it happen when creating a thread just after storing $~?

It's simply not set in the new thread, and that seems already the same behavior on CRuby:

$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux]
#<MatchData "a">
nil
$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
truffleruby 24.1.0-dev-c2e5209c, like ruby 3.2.2, GraalVM CE Native [x86_64-linux]
#<MatchData "a">
nil

ko1 (Koichi Sasada) wrote in #note-11:

I found an idea that each thread points to unescaped MatchData rather than $~ and reuse it.

I think that's too incompatible because $~ is frame-local and thread-local, so we need multiple $~ per thread, as @byroot (Jean Boussier) showed.

Updated by Eregon (Benoit Daloze) about 6 hours ago · Edited

byroot (Jean Boussier) wrote in #note-9:

I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.

I'm thinking cases of Regexps being stored in constants and potentially composed of other regexps/strings, like https://github.com/ruby/uri/blob/master/lib/uri/rfc3986_parser.rb does it for example.
It seems bad to duplicate some of these regexps if (for the same large Regexp) we have some call sites which need the MatchData and some which don't.

Also Regexp#match (which returns a MatchData) would make no sense with that flag, so it feels the wrong place to specify it.

Regarding gsub/sub specifically, I think it shouldn't set $~, i.e. only set it if a block is passed.
For example with "abc".gsub("a", "d") I don't see any point to set $~ after it.
That's potential incompatible, but we could warn for a release or so and I'd think few gems rely on that.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0