Misc #20652: Memory allocation for gsub has increased from Ruby 2.7 to 3.3 - Ruby - Ruby Issue Tracking System

Actions

Copy link

Misc #20652

closed

Memory allocation for gsub has increased from Ruby 2.7 to 3.3

Added by orisano (Nao Yonashiro) 12 months ago. Updated 5 months ago.

Status:

Closed

Assignee:

jeremyevans0 (Jeremy Evans)

[ruby-core:118682]

Description

I recently upgraded from ruby 2.7.7 to 3.3.1 and noticed that the GC load increased.
When I used the allocation profiler to investigate, I found that memory allocation from gsub had increased.

The problem was code like this:

s = "foo              "
s.gsub(/ (\s+)/) { " #{'&nbsp;' * Regexp.last_match(1).length}" }

When I compared the results of heap-profiler between 2.7.7 and 3.3.1, I found that MatchData was increasing.

https://gist.github.com/orisano/98792dee260106e9b6fcb45bbabeb1e6

https://github.com/ruby/ruby/commit/abc0304cb28cb9dcc3476993bc487884c139fd11

I discovered that the cause is this commit, which stopped reusing backref to avoid race conditions.
Is there a way to reuse backref while still avoiding race conditions?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by mame (Yusuke Endoh) 12 months ago

Related to Bug #17507: Regexp capture groups ignored sometimes in some multithreaded environments (possible race condition) added

Actions

Copy link

#2 [ruby-core:118683]

Updated by mame (Yusuke Endoh) 12 months ago

Assignee set to jeremyevans0 (Jeremy Evans)

@jeremyevans0 (Jeremy Evans) What do you think?

Actions

Copy link

#3 [ruby-core:118685]

Updated by jeremyevans0 (Jeremy Evans) 12 months ago

It's expected that fixing #17507 caused memory usage to increase. If anyone can come up with an approach that fixes #17507 without causing an increase in memory usage, please submit a pull request for it.

Actions

Copy link

#4 [ruby-core:118689]

Updated by byroot (Jean Boussier) 12 months ago

If anyone can come up with an approach that fixes #17507 without causing an increase in memory usage

I guess I'm a bit surprised by that because I wouldn't have thought the backref is only accessible by the same fiber, but maybe I'm missing something.

But on a more general note, I very very often wish I'd have a way to not have a MatchData created by methods that take a Regexp, it's very rare you need the MatchData created by gsub or Regexp===. Most of the time it's fine, but when you are trying to optimize a hotspot, it's really something you'd want to be able to skip. It's in big part why Regexp#match? was added and popularized, but is only usable in a subset of cases.

Maybe we could add a new Regexp flag to turn off this behavior?

e.g.

case str
when /^a/
  p $~ # => #<MatchData "a">
when /^b/c
  p $~ # => nil
end

Actions

Copy link

#5 [ruby-core:118691]

Updated by shyouhei (Shyouhei Urabe) 12 months ago

I agree this is counter-intuitive. The #17507 problem was that

i = lambda { ...(touches $~)... }

many.times { Thread.start { many.times { i.call } } }

would break. This is because $~ is i's local variable. i is shared across threads so is $~.

@jeremyevans0 (Jeremy Evans) fixed this so that no two $~s are identical, no matter multi threaded situation or not.

The fix of course increases memory usage. But because the root cause of the problem is sharing local variables across threads, to unshare them the bloat seems kind of inevitable to me. I also think this is a needed fix. Nobody thinks gsub modifies $~ behind the scene.

I would agree with @byroot (Jean Boussier) that it could be better if we had a gsub variant that doesn't touch $~.

Actions

Copy link

#6 [ruby-core:118692]

Updated by byroot (Jean Boussier) 12 months ago

it could be better if we had a gsub variant that doesn't touch $~.

Right, but the problem is beyond gsub, e.g.:

>> "abba"[/(bb|[^b]{2})/]
=> "bb"
>> $~
=> #<MatchData "bb" 1:"bb">

So introducing gsub( backref: false) or gsub_no_backref would solve one case but would leave many others.

Hence why I think the only way to handle this general problem without introducing tons of new methods and lots of code churn would be a new Regexp flag.

Actions

Copy link

#7 [ruby-core:118693]

Updated by Eregon (Benoit Daloze) 12 months ago · Edited

FWIW, what TruffleRuby does for this is to store $~ as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:
https://github.com/oracle/truffleruby/blob/3cd422433deebe3fa664f8c4540811c42ca02e93/src/main/java/org/truffleruby/language/threadlocal/ThreadAndFrameLocalStorage.java

I'm not sure how it works on CRuby, but if $~ is stored directly in the frame then threads might see a different $~ than they expect which could lead to very subtle bugs.

I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).

I think in general a good fix to simplify this and avoid this kind of races would be to store $~ in the caller frame (even if that's a block's frame) but not higher.
In this case it would be stored in the lambda's frame and not outside.
That's also quite a bit faster.
Of course it would be somewhat incompatible, but how much code uses a $~ outside a block when the Regexp call is made inside a block?
We could warn that such code should not rely on that for a release or so, before changing it.

Actions

Copy link

#8 [ruby-core:118694]

Updated by ko1 (Koichi Sasada) 12 months ago

Eregon (Benoit Daloze) wrote in #note-7:

FWIW, what TruffleRuby does for this is to store $~ as a frame-local thread-local variable, but thread-local only if more than 1 thread has been seen, otherwise it's stored directly in the frame:

off-topic:

what does it happen when creating a thread just after storing $~?

Actions

Copy link

#9 [ruby-core:118695]

Updated by byroot (Jean Boussier) 12 months ago

I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).

I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.

Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...

Actions

Copy link

#10 [ruby-core:118697]

Updated by austin (Austin Ziegler) 12 months ago

byroot (Jean Boussier) wrote in #note-9:

I don't really like a Regexp flag for this because a Regexp might be used in different contexts and some usages might want $~ and some might not (which could lead to a bunch of duplication).

I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.

Of course you could expect Rubocop and such to try to force it on everyone because it's faster™, but...

It could be a flag only accessible through Regexp.new (Regexp.new(source, Regexp::OPTIMIZED_MATCHDATA)) instead of having it with a literal (/…/z). The ugliness would make it less likely that Rubocop would try to force it on everyone.

Actions

Copy link

#11 [ruby-core:118702]

Updated by ko1 (Koichi Sasada) 12 months ago · Edited

I found an idea that each thread points to unescaped MatchData rather than $~ and reuse it.
In other words, all generated MatchData will be cached by th->last_matchdata (or similar) and use it across scopes.

def foo
  if /.../ =~ ... # generate MatchData1
    bar()
  end
end

def bar
  m = $str.match(//) # reuse MatchData1
end

It is thread-safe and increases an opportunity to reuse.

Actions

Copy link

#12 [ruby-core:118707]

Updated by byroot (Jean Boussier) 12 months ago

I don't follow, how can it be re-used in your example? :

def foo
  if /foo/ =~ "foo" # generate MatchData1
    bar()
    p $~ # #<MatchData "all">
  end
end

def bar
  m = "all".match(/all/) # reuse MatchData1
  p $~ # #<MatchData "foo">
end

foo

Actions

Copy link

#13 [ruby-core:118708]

Updated by byroot (Jean Boussier) 12 months ago

Interestingly the NO_MATCH regexp options was suggested in the ticket that led to Regexp#match? [Feature #8110]

Actions

Copy link

#14 [ruby-core:118709]

Updated by Eregon (Benoit Daloze) 12 months ago

ko1 (Koichi Sasada) wrote in #note-8:

what does it happen when creating a thread just after storing $~?

It's simply not set in the new thread, and that seems already the same behavior on CRuby:

$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux]
#<MatchData "a">
nil
$ ruby -ve '"a" =~ /a/; p $~; Thread.new { p $~ }.join'
truffleruby 24.1.0-dev-c2e5209c, like ruby 3.2.2, GraalVM CE Native [x86_64-linux]
#<MatchData "a">
nil

ko1 (Koichi Sasada) wrote in #note-11:

I found an idea that each thread points to unescaped MatchData rather than $~ and reuse it.

I think that's too incompatible because $~ is frame-local and thread-local, so we need multiple $~ per thread, as @byroot (Jean Boussier) showed.

Actions

Copy link

#15 [ruby-core:118710]

Updated by Eregon (Benoit Daloze) 12 months ago · Edited

byroot (Jean Boussier) wrote in #note-9:

I see what you mean, but such flag would only really be worth using in places where saving that allocation is worth it, where right now you usually use a literal anyway, so I don't think duplication would be a concern.

I'm thinking cases of Regexps being stored in constants and potentially composed of other regexps/strings, like https://github.com/ruby/uri/blob/master/lib/uri/rfc3986_parser.rb does it for example.
It seems bad to duplicate some of these regexps if (for the same large Regexp) we have some call sites which need the MatchData and some which don't.

Also Regexp#match (which returns a MatchData) would make no sense with that flag, so it feels the wrong place to specify it.

Regarding gsub/sub specifically, I think it shouldn't set $~, i.e. only set it if a block is passed.
For example with "abc".gsub("a", "d") I don't see any point to set $~ after it.
That's potential incompatible, but we could warn for a release or so and I'd think few gems rely on that.

Actions

Copy link

#16 [ruby-core:118711]

Updated by ko1 (Koichi Sasada) 12 months ago

Eregon (Benoit Daloze) wrote in #note-14:

ko1 (Koichi Sasada) wrote in #note-11:

I found an idea that each thread points to unescaped MatchData rather than $~ and reuse it.

I think that's too incompatible because $~ is frame-local and thread-local, so we need multiple $~ per thread, as @byroot (Jean Boussier) showed.

No. It is not user visible behavior so no incompatiblity.
(don't change $~)

Actions

Copy link

#17 [ruby-core:118716]

Updated by byroot (Jean Boussier) 12 months ago

I'm thinking cases of Regexps being stored in constants and potentially composed of other regexps/strings, like https://github.com/ruby/uri/blob/master/lib/uri/rfc3986_parser.rb does it for example.

Sure, there are cases where it wouldn't be convenient. But the thing is, adding this extra flag would only really make a difference in hotspots so I don't mind too much if there are some cases where it's not super convenient.

So I don't think it's a good argument against.

Also Regexp#match (which returns a MatchData) would make no sense with that flag, so it feels the wrong place to specify it.

With the name I suggested, maybe, but with the proper name it would be fine.

Regarding gsub/sub specifically, I think it shouldn't set $~, i.e. only set it if a block is passed.

That has backward compatibility concerns, unlikely to be accepted, and even if it was, the deprecation period would be annoying for not so much gain.

Actions

Copy link

#18 [ruby-core:118720]

Updated by Dan0042 (Daniel DeLorme) 12 months ago

byroot (Jean Boussier) wrote in #note-4:

Maybe we could add a new Regexp flag to turn off this behavior?

My first reaction was "Yes! This is exactly was we need!" but after thinking more it feels un-rubyish. We shouldn't have to write code to micro-tweak the performance like that. Ideally the interpreter/jit should handle micro-optimizations. I'd be happier to see something like:

def foo(str)
  str if str =~ /rx/
  #in this method we don't use $~ and friends, so the interpreter doesn't have to allocate MatchData
  #yes in theory there's an incompatibility with eval, but in practice I believe that's a non-issue (I'm open to be shown otherwise)
end

shyouhei (Shyouhei Urabe) wrote in #note-5:

Nobody thinks gsub modifies $~ behind the scene.

I'm not sure this is what you meant, but I definitely expect ANY regexp operation to modify $~ (except where documented otherwise, like Regexp#match?)
BTW this includes String#sub; I sometimes write code like: name, prefix = str.sub(/^(the) /i), $1

Actions

Copy link

#19 [ruby-core:118722]

Updated by Dan0042 (Daniel DeLorme) 12 months ago · Edited

After reading over https://github.com/ruby/ruby/pull/4734/files it seems there's two parts to it.

use a set_match pointer to return the match (this fixes the race condition)
always allocate a MatchData, never using rb_backref_get()

But it seems to me that #2 is only necessary if set_match is used. So what about using rb_backref_get() when possible? Like

match = set_match ? Qnil : rb_backref_get();

@jeremyevans0 (Jeremy Evans) wdyt?

Actions

Copy link

#20 [ruby-core:118723]

Updated by jeremyevans0 (Jeremy Evans) 12 months ago

Dan0042 (Daniel DeLorme) wrote in #note-19:

After reading over https://github.com/ruby/ruby/pull/4734/files it seems there's two parts to it.

use a set_match pointer to return the match (this fixes the race condition)

always allocate a MatchData, never using rb_backref_get()

But it seems to me that #2 is only necessary if set_match is used. So what about using rb_backref_get() when possible? Like
match = set_match ? Qnil : rb_backref_get();
@jeremyevans0 (Jeremy Evans) wdyt?

I'm not sure it is thread-safe. This would modify a shared backref in code paths where set_match is NULL. I haven't audited the related code, so I'm not sure what code calls rb_reg_search0 and rb_reg_match. Feel free to give it a try and see if passes the test and reduces allocations.

Actions

Copy link

#21 [ruby-core:118724]

Updated by byroot (Jean Boussier) 12 months ago

I'd be happier to see something like:

@Eregon (Benoit Daloze) given how good TruffleRuby is at escape analysis and such, before I dive into why it wasn't done before, is TruffleRuby able to not create the MatchData when it's not accessed, or is there something in the semantic that make it impossible to predict?

Actions

Copy link

#22 [ruby-core:118728]

Updated by ko1 (Koichi Sasada) 12 months ago

ko1 (Koichi Sasada) wrote in #note-16:

No. It is not user visible behavior so no incompatiblity.

Sorry my wrong. Please ignore about it.

Actions

Copy link

#23 [ruby-core:118729]

Updated by byroot (Jean Boussier) 12 months ago

is there something in the semantic that make it impossible to predict?

Answering to myself:

def match
  "foo" =~ /f(o)o/
  eval("$1")
end

p match

Actions

Copy link

#24 [ruby-core:118736]

Updated by Eregon (Benoit Daloze) 12 months ago

@byroot (Jean Boussier) It depends in which situation but generally yes it's able to avoid the allocation.
If there is no block around, partial escape analysis avoids the allocation of the Ruby MatchData object as long as it's not leaked/stored globally (like $m = $~), even if it is accessed in that method.
There might be still be an allocation of the internal data structure representing the group offsets, if the Regexp sometimes matches and sometimes not (but tail duplication can fix this in some cases, e.g. if there is not too much code to duplicate).

If there is 1/multiple block(s) around the regexp match, then $~ is stored in the method's frame and not the block's frame and then it's allocated unless there is a compilation covering the method and inlining everything until that block.

The case I checked is:

def foo
  "a" =~ /(a)/
  $1
end

loop { foo() }

with cd truffleruby && chruby truffleruby+graalvm-24.0.2 && jt -u ruby graph test.rb and that shows the only allocation is for a new Ruby String (the return value).

Actions

Copy link

#25 [ruby-core:118748]

Updated by byroot (Jean Boussier) 12 months ago

Right, so it's not as simple as marking the ISeq as not needing the backref because it doesn't use getspecial.

I think we could only realistically do it in MRI if we accepted that $~ and such wouldn't be accessible from eval. Could be worth asking at the developer meeting, but I'd be surprised if it was accepted.

Actions

Copy link

#26 [ruby-core:118749]

Updated by Dan0042 (Daniel DeLorme) 12 months ago

I'd be surprised if it was accepted.

Same here. Although perhaps I should clarify that $~ would be accessible in eval if also present as a literal in the method.

Actions

Copy link

#27

Updated by hsbt (Hiroshi SHIBATA) 10 months ago

Status changed from Open to Assigned

Actions

Copy link

#28

Updated by byroot (Jean Boussier) 5 months ago

Status changed from Assigned to Closed

Applied in changeset git|87f9c3c65e38fa3e5c6ef097e2cf63ff448f48d6.

String#gsub! Elide MatchData allocation when we know it can't escape

In gsub is used with a string replacement or a map that doesn't
have a default proc, we know for sure no code can cause the MatchData
to escape the gsub call.

In such case, we still have to allocate a new MatchData because we
don't know what is the lifetime of the backref, but for any subsequent
match we can re-use the MatchData we allocated ourselves, reducing
allocations significantly.

This partially fixes [Misc #20652], except when a block is used,
and partially reduce the performance impact of
abc0304cb28cb9dcc3476993bc487884c139fd11 / [Bug #17507]

compare-ruby: ruby 3.5.0dev (2025-02-24T09:44:57Z master 5cf146399f) +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-02-24T10:58:27Z gsub-elude-match da966636e9) +PRISM [arm64-darwin24]
warming up....

|                 |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|escape           |      3.577k|    3.697k|
|                 |           -|     1.03x|
|escape_bin       |      5.869k|    6.743k|
|                 |           -|     1.15x|
|escape_utf8      |      3.448k|    3.738k|
|                 |           -|     1.08x|
|escape_utf8_bin  |      6.361k|    7.267k|
|                 |           -|     1.14x|

Co-Authored-By: Étienne Barrié etienne.barrie@gmail.com

Actions

Copy link

#29 [ruby-core:121152]

Updated by byroot (Jean Boussier) 5 months ago

The commit says the issue is partially fixed, but I made a followup that elide the MatchData allocation in almost every cases. So I think this issue can be considered fully fixed.

Actions

Copy link

#30 [ruby-core:121153]

Updated by byroot (Jean Boussier) 5 months ago

Sorry, I wasn't clear. The allocation is eluded if the previous MatchData wasn't accessed. If it was, I'm afraid there's nothing we can safely do.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Misc #20652

Memory allocation for gsub has increased from Ruby 2.7 to 3.3

Updated by mame (Yusuke Endoh) 12 months ago

Updated by mame (Yusuke Endoh) 12 months ago

Updated by jeremyevans0 (Jeremy Evans) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by shyouhei (Shyouhei Urabe) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by Eregon (Benoit Daloze) 12 months ago · Edited

Updated by ko1 (Koichi Sasada) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by austin (Austin Ziegler) 12 months ago

Updated by ko1 (Koichi Sasada) 12 months ago · Edited

Updated by byroot (Jean Boussier) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by Eregon (Benoit Daloze) 12 months ago

Updated by Eregon (Benoit Daloze) 12 months ago · Edited

Updated by ko1 (Koichi Sasada) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by Dan0042 (Daniel DeLorme) 12 months ago

Updated by Dan0042 (Daniel DeLorme) 12 months ago · Edited

Updated by jeremyevans0 (Jeremy Evans) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by ko1 (Koichi Sasada) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by Eregon (Benoit Daloze) 12 months ago

Updated by byroot (Jean Boussier) 12 months ago

Updated by Dan0042 (Daniel DeLorme) 12 months ago

Updated by hsbt (Hiroshi SHIBATA) 10 months ago

Updated by byroot (Jean Boussier) 5 months ago

Updated by byroot (Jean Boussier) 5 months ago

Updated by byroot (Jean Boussier) 5 months ago