Project

General

Profile

Actions

Bug #4040

closed

SystemStackError with Hash[*a] for Large _a_

Added by runpaint (Run Paint Run Run) over 13 years ago. Updated 6 months ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 1.9.3dev (2010-11-09 trunk 29737) [x86_64-linux]
[ruby-core:33132]
Tags:

Description

=begin
I've been hesitating over whether to file a ticket about this, so please feel free to close if I've made the wrong choice.

I often use Hash[*array.flatten] in IRB to convert arrays of arrays into hashes. Today I noticed that if the array is big enough, this would raise a SystemStackError. Puzzled, I looked deeper. I assumed I was hitting the maximum number of arguments a method's argc can hold, but realised that the minimum size of the array needed to trigger this exception differed depending on whether I used IRB or not. So, presumably this is indeed exhausting the stack...

In IRB, the following is the minimal reproduction of this problem:

Hash[*130648.times.map{ 1 }]; true

I haven't looked for the minimum value needed with ruby -e, but the following reproduces:

ruby -e 'Hash[*1380888.times.map{ 1 }]'

I suppose this isn't technically a bug, but maybe it offers another argument for either #666 or an extension of #3131.
=end


Related issues 2 (0 open2 closed)

Related to Ruby master - Bug #982: stack level too deep for long Array initializationClosedko1 (Koichi Sasada)12/29/2008Actions
Related to Ruby master - Bug #5719: Hash::[] can't handle 100000+ argsClosedko1 (Koichi Sasada)Actions
Actions #1

Updated by duerst (Martin Dürst) over 13 years ago

=begin
This bug may be related to bug #982.
=end

Updated by naruse (Yui NARUSE) over 12 years ago

  • Status changed from Open to Assigned
  • Assignee set to ko1 (Koichi Sasada)

Updated by nahi (Hiroshi Nakamura) over 12 years ago

  • Target version set to 1.9.3

Updated by ko1 (Koichi Sasada) over 12 years ago

  • Target version changed from 1.9.3 to 2.0.0

Let us pending it to next version....

Updated by ko1 (Koichi Sasada) over 11 years ago

  • Target version changed from 2.0.0 to 2.6

We need re-consideration about method invocation to support such cases.
I want to challenge at next version.

Updated by ko1 (Koichi Sasada) about 7 years ago

Current implementation: now splatting huge parameters (and receive rest arguments) for Ruby methods are fine. However, C methods doesn't support this pattern. It should be fixed.

Actions #7

Updated by ko1 (Koichi Sasada) about 7 years ago

  • Related to Bug #5719: Hash::[] can't handle 100000+ args added
Actions #8

Updated by naruse (Yui NARUSE) about 6 years ago

  • Target version deleted (2.6)
Actions #10

Updated by ko1 (Koichi Sasada) about 1 year ago

  • Status changed from Assigned to Closed

Applied in changeset git|2e7bceb34ea858649e1f975a934ce1894d1f06a6.


Do not use VM stack for splat arg on cfunc

On the cfunc methods, if a splat argument is given, all array elements
are expanded on the VM stack and it can cause SystemStackError.
The idea to avoid it is making a hidden array to contain all parameters
and use this array as an argv.

This patch is reviesed version of https://github.com/ruby/ruby/pull/6816
The main change is all changes are closed around calling cfunc logic.

Fixes [Bug #4040]

Co-authored-by: Jeremy Evans

Updated by jeremyevans0 (Jeremy Evans) about 1 year ago

  • Status changed from Closed to Open

I'm reopening this issue.

Originally, when 2e7bceb34ea858649e1f975a934ce1894d1f06a6 fixed cfuncs to no longer use the VM stack for large array splats, it was thought to have fully fixed this issue, since the issue was fixed for methods defined in Ruby (iseqs) back in Ruby 2.2.

After additional research, I determined that the same issue affects almost all types of method calls, not just iseq and cfunc calls. There were two main types of remaining issues, important cases (where large array splat should work) and pedantic cases (where large array splat raised SystemStackError instead of ArgumentError).

Important cases:

# bmethod
define_method(:a){|*a|}
a(*1380888.times)

# send
def b(*a); end
send(:b, *1380888.times)

# symproc
:b.to_proc.call(self, *1380888.times)

# method to proc
def d; yield(*1380888.times) end
d(&method(:b))

# method_missing
def self.method_missing(*a); end
not_a_method(*1380888.times)

# Also affects use of C-API with a large number of arguments

Pedantic cases:

# iseq with only required or optional arguments
def a; end
a(*1380888.times)
def b(_); end
b(*1380888.times)
def c(_=nil); end
c(*1380888.times)

# attr reader/writer
c = Class.new do
  attr_accessor :a
  alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)

# Struct aref/aset
c = Struct.new(:a) do
  alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)

I have submitted a pull request to fix the issue: https://github.com/ruby/ruby/pull/7522. It's fairly invasive, since the use of the VM stack for handling arguments occurs in multiple different places in the VM, and all the different method call handlers need to be updated to support it. This has a runtime cost, but I'm not sure exactly how much. I would appreciate if someone could do some benchmarking with it. We'll need to decide if fixing this issue is worth the potential performance hit (or someone could potentially optimize the approach I used).

Updated by Eregon (Benoit Daloze) about 1 year ago

My opinion as a TruffleRuby implementer is I don't think TruffleRuby will or can support that.
So at least it should be clear that's not something really supported by the Ruby language in general.

User code shouldn't pass a million arguments, if it's say > 20 arguments or so one should pass an array, splatting on both sides is silly, it just creates extra copies and is slower.
It's a Ruby anti-pattern.

So IMO there is no point to fix or support this, such code doesn't make sense and changing the user code is the best fix, for performance and simplicity.

Updated by Eregon (Benoit Daloze) about 1 year ago

In other words, for efficiency arguments need to be passed on the stack, e.g. for JITed code.
Sure it could be possible to have special handling for "too many arguments" and do something else, but that means making Ruby slower in general, which seems clearly not worth for degenerate cases such as this one.

Updated by Eregon (Benoit Daloze) about 1 year ago

Regarding the original report, there is a clear solution now with to_h:

1000.times.each_slice(2).to_a.to_h
# or
1000.times.each_slice(2).to_h

Updated by jeremyevans0 (Jeremy Evans) 12 months ago

Considering we fixed this issue for iseq and cfunc methods, probably the main reason to reject fixing this issue for the remaining method types is that doing so potentially lowers performance. To try to counteract this, I've updated my pull request to include an optimization for bmethods that improves performance 38-59% for simple method calls, and up to 180% for method calls with keywords. Hopefully this increase in bmethod performance more than cancels out any performance decreases caused by the large argument splat handling.

Updated by jeremyevans0 (Jeremy Evans) 12 months ago

I've updated my pull request to include additional optimizations for:

  • cfunc: 10-15* for f(*a) and 35-40% for f(*a, **kw) if kw is empty
  • send: 5-115% depending on type of call
  • symproc: 5-100% depending of type of call
  • method_missing: 10-115% depending on type of call

The cfunc optimization works by copying the array contents to the stack instead of using CALLER_SETUP_ARG.

The send, symproc, and method_missing optimizations are achieved by avoiding unnecessary use of CALLER_SETUP_ARG.

Hopefully these additional optimizations help offset any performance decrease from the additional checks needed to fix this issue.

Updated by ko1 (Koichi Sasada) 11 months ago

Quote from devmeeting agenda https://bugs.ruby-lang.org/issues/19525:

The fix results in a minor performance decrease in microbenchmarks.

Could you show more details (results)?
Do you have an analysis which line(s) makes slower?


I don't think this feature should be rejected. It is cool to support this feature (long splat can be accepted by rest argument).
However, personally speaking I feel the proposed patch (https://github.com/ruby/ruby/pull/7522) is too complex for future maintenance comparing with the benefits from the patch.

Now I have no time to review the patch closely and I couldn't confirm this patch has such issue.
So I agree to merge it (and rewrite them if they can be more improved) because it is well tested.
I think it is better to have benchmark measurements on some benchmarks, though.

General comments:

  • Some code are duplicated so maybe they can be more shorter.
  • (optimization) It is not sure how the optimization target cases are there (optimizations for the minor cases can introduce issues such as i-cache miss, difficulty on future maintenance and so on).

Updated by jeremyevans0 (Jeremy Evans) 11 months ago

ko1 (Koichi Sasada) wrote in #note-17:

Quote from devmeeting agenda https://bugs.ruby-lang.org/issues/19525:

The fix results in a minor performance decrease in microbenchmarks.

Could you show more details (results)?

In terms of existing benchmarks:

  • For app_fib benchmark about 1-3% decrease.
  • vm_send and vm_send_var benchmark improves 3-5% due to the send optimization.

I'll try to do some more benchmarking tomorrow and report back. Is there a decent real world benchmark in benchmarks I can use? With the patch set, some microbenchmarks are slower, but some cases I optimized (bmethod/send/symproc/method_missing) are much faster (over 2x). A real world benchmark would be more useful to determine the actual performance differences.

I do most of my development on OpenBSD, which is a bit suboptimal for benchmarking small differences in performance in my experience (possibly due to the additional randomization).

If someone could run yjit-bench on the pull request branch (in interpreter mode), that would be very helpful.

Do you have an analysis which line(s) makes slower?

Unfortunately, I don't. My guess would be it is due to the additional branches in CALLER_SETUP_ARG and checking for calling->heap_argv.

I don't think this feature should be rejected. It is cool to support this feature (long splat can be accepted by rest argument).
However, personally speaking I feel the proposed patch (https://github.com/ruby/ruby/pull/7522) is too complex for future maintenance comparing with the benefits from the patch.

Agreed. I wish the patch could be made simpler, but I think most of the complexity of the patch is necessary if we want to fix the bug.

Now I have no time to review the patch closely and I couldn't confirm this patch has such issue.
So I agree to merge it (and rewrite them if they can be more improved) because it is well tested.

OK. Before it is merged, the yjit team needs to make the necessary changes to yjit to support it. Alternatively, they could temporarily disable parts of yjit this breaks, but from talking to @alanwu (Alan Wu), that could result in temporarily disabling a lot of yjit. I think it would be preferable to fix yjit before this is merged.

I think it is better to have benchmark measurements on some benchmarks, though.

I added some benchmarks related to the optimizations I added, and basic results of those benchmarks in the in related commits. As mentioned above, I'll try to do additional benchmarking and report back.

General comments:

  • Some code are duplicated so maybe they can be more shorter.

OK. I will review and see if I can eliminate the redundant code.

  • (optimization) It is not sure how the optimization target cases are there (optimizations for the minor cases can introduce issues such as i-cache miss, difficulty on future maintenance and so on).

The bmethod/send/symproc/method_missing optimizations are all very large in certain cases and should definitely be included.

The cfunc optimizations are limited to specific cases (*args or *args, **kw with empty kw) and not as large even in those cases (10-15% for *args, 35-40% for *args, **kw with empty kw). I'm guessing they are still a net performance improvement, though.

Updated by ko1 (Koichi Sasada) 11 months ago

jeremyevans0 (Jeremy Evans) wrote in #note-18:

The bmethod/send/symproc/method_missing optimizations are all very large in certain cases and should definitely be included.

I mean how much "certain cases" are there in apps. Debug counter feature in debug_counter.[ch] will help to confirm such statistics.

Updated by jeremyevans0 (Jeremy Evans) 11 months ago

I rebased my branch against master, and then ran all of the app_* benchmarks, here are the results:

app_aobench: +1%
app_erb: 0%
app_factorial: 0%
app_fib: +5%
app_lc_fizzbuzz: 0%
app_mandelbrot: 0%
app_pentomino: -1%
app_raise: 0%
app_strconcat: +8%
app_tak: +4%
app_tarai: +3%
app_uri: -2%

For most of the benchmarks, I ran with --repeat-count 10 --repeat-result best (some take a long time and I only ran with 1 or 3 instead of 10).

So from this benchmarking, only app_pentomino and app_uri are slower, by 1-2%. 5 benchmarks are faster, by up to 8%. 5 benchmarks did not show any performance differences.

app_fib is showing up 5% faster now, when it was previously showing 1-3% slower. To make sure this wasn't an anomaly, I ran with --repeat-count 25, and still got the same results. Again, this could just be due to my environment (OpenBSD), as I cannot think of a reason why app_fib would be faster with the changes.

All of these benchmarks are more of the microbenchmark nature. More realistic benchmarks such as yjit-bench on Linux would be better for testing actual differences in performance.

Updated by jeremyevans0 (Jeremy Evans) 11 months ago

I ran yjit-bench with both the master branch and the PR branch. Here are the results:

Total time spent benchmarking: 5172s

master: ruby 3.3.0dev (2023-04-14T03:43:46Z master 3733ee835b) [x86_64-openbsd7.3]
heap_argv: ruby 3.3.0dev (2023-04-15T06:35:36Z large-array-splat-.. a0eb73211c) [x86_64-openbsd7.3]

--------------  -----------  ----------  --------------  ----------  ----------------  -----------------
bench           master (ms)  stddev (%)  heap_argv (ms)  stddev (%)  master/heap_argv  heap_argv 1st itr
activerecord    150.8        2.4         150.9           2.1         1.00              0.97             
erubi_rails     52.6         7.9         52.9            8.1         1.00              1.05             
hexapdf         6996.9       1.0         6925.5          0.6         1.01              1.11             
liquid-c        177.5        2.0         175.3           1.5         1.01              1.03             
liquid-compile  165.6        2.8         165.5           2.0         1.00              1.01             
liquid-render   372.2        0.6         374.9           1.6         0.99              1.01             
mail            389.6        0.7         394.0           2.0         0.99              1.01             
psych-load      6431.2       0.2         6356.4          0.3         1.01              1.01             
railsbench      4654.3       0.3         4696.0          0.6         0.99              0.99             
ruby-lsp        159.6        6.0         155.6           5.8         1.03              1.05             
sequel          215.0        2.6         214.8           0.9         1.00              1.00             
binarytrees     840.2        0.3         840.2           0.9         1.00              0.99             
chunky_png      2710.0       0.2         2739.4          0.4         0.99              0.98             
erubi           732.7        1.6         726.9           1.1         1.01              1.02             
etanni          984.5        1.6         974.1           0.5         1.01              1.01             
fannkuchredux   4282.9       0.2         4334.9          0.2         0.99              0.99             
lee             3625.8       0.4         3594.9          0.3         1.01              1.01             
nbody           183.7        0.9         178.7           0.2         1.03              1.03             
optcarrot       9673.8       1.0         9626.9          0.9         1.00              1.01             
ruby-json       9889.0       0.1         9848.9          0.4         1.00              1.01             
rubykon         23063.9      0.5         22953.8         0.3         1.00              1.00             
30k_ifelse      3829.2       0.5         3824.4          1.0         1.00              0.99             
30k_methods     7761.7       0.2         7665.6          0.2         1.01              1.01             
cfunc_itself    327.7        0.3         326.6           0.5         1.00              1.00             
fib             466.2        0.3         469.2           0.6         0.99              1.00             
getivar         221.7        0.6         222.1           0.3         1.00              1.00             
keyword_args    652.6        0.2         653.2           0.3         1.00              1.00             
respond_to      893.4        0.2         909.5           0.1         0.98              0.98             
setivar         148.3        0.3         143.5           0.2         1.03              1.01             
setivar_object  295.0        0.5         291.6           0.5         1.01              1.01             
setivar_young   295.1        0.4         291.6           0.6         1.01              1.01             
str_concat      231.9        3.2         211.0           2.3         1.10              1.07             
throw           40.2         1.3         41.6            9.2         0.97              0.98             
--------------  -----------  ----------  --------------  ----------  ----------------  -----------------
Legend:
- master/heap_argv: ratio of master/heap_argv time. Higher is better for heap_argv. Above 1 represents a speedup.
- heap_argv 1st itr: ratio of master/heap_argv time for the first benchmarking iteration.

So it looks like it is slower on 8 benchmarks (6 1% slower, 1 2% slower, 1 3% slower), and faster on 13 benchmarks (9 1% faster, 3 3% faster, 1 10% faster). So on the whole, it looks like a net performance increase.

It would be good to get benchmark results from Linux, so if someone could contribute that, I would appreciate it.

Updated by k0kubun (Takashi Kokubun) 11 months ago

It would be good to get benchmark results from Linux, so if someone could contribute that, I would appreciate it.

I have a x86_64-linux environment with CPU frequency scaling disabled, so I benchmarked your PR with yjit-bench. I used --category headline because other benchmarks are less important in practice.

before: ruby 3.3.0dev (2023-04-14T03:43:46Z master 3733ee835b) [x86_64-linux]
after: ruby 3.3.0dev (2023-04-15T06:35:36Z large-array-splat-.. a0eb73211c) [x86_64-linux]

--------------  -----------  ----------  ----------  ----------  ------------  -------------
bench           before (ms)  stddev (%)  after (ms)  stddev (%)  before/after  after 1st itr
activerecord    65.6         0.4         65.8        0.3         1.00          0.99
erubi_rails     18.7         1.3         18.7        12.4        1.00          1.02
hexapdf         2164.2       0.4         2181.5      0.8         0.99          0.98
liquid-c        57.0         1.6         57.0        1.9         1.00          0.99
liquid-compile  52.6         0.6         52.4        0.6         1.00          1.00
liquid-render   139.8        1.3         140.2       1.1         1.00          1.00
mail            118.1        0.1         118.7       0.2         1.00          1.00
psych-load      1707.5       0.2         1750.9      0.1         0.98          0.97
railsbench      1907.4       0.8         1929.4      0.8         0.99          0.99
ruby-lsp        59.5         10.3        59.4        11.9        1.00          0.98
sequel          65.6         0.2         65.6        0.2         1.00          1.00
--------------  -----------  ----------  ----------  ----------  ------------  -------------

I tried running them a few times. In before/after, psych-load is stably 2% slower. hexapdf and railsbench show a 1% slowdown, which may be insignificant. Other benchmarks seem to have no difference.

Updated by Eregon (Benoit Daloze) 11 months ago

@jeremyevans0 (Jeremy Evans)

I rebased my branch against master, and then ran all of the app_* benchmarks, here are the results:

Are the +N% there improvements or regressions? From those numbers it sounds like + would be regressions (i.e., more time to execute the same thing).


I am thinking a bit more about the implications of this for Ruby implementations and JITs.
Only passing on the stack means not allowed to pass a huge number of arguments (the case on TruffleRuby).
Only passing as a heap array seems inefficient in general (would cause extra allocations, at least in interpreter, for foo(1, 2)).
I guess one could use 2 different calling conventions, on stack if no rest parameter, on heap if there is a rest parameter. But more calling conventions is a clear cost as it causes extra checks for every call, even more so for polymorphic call site (+ it's messy to do callee-specific logic in the caller).

If supporting to pass both arguments on the stack or in a heap array, then the called method (the callee) will most likely need to branch and find out from where to read arguments.
It seems always an anti-pattern to have the callee need to deal with two calling conventions.
That may actually be easier to deal with in C because a VALUE* pointer can represent both, then it would be one check on method entry for which pointer and size to use.
In Java, if passing arguments as an Object[] and having hidden arguments at the start of the array, there is no way to share the logic with a Ruby Array from the heap, or it would need some offset for every argument access, which seems very expensive.
I suppose one could technically compile 2 variants of a method, one for on stack and one for heap array, but it seems very expensive from a warmup and memory perspective, and it's again costing more calling conventions.
Also when using array storage strategies, the array might be int[] behind the scenes and then passing it as a single argument vs a splat is so so so much faster.

Basically, I think efficient Ruby implementations and JITs might not want to deal with the complexity of on-heap arguments.
Such usage pattern is intrinsically inefficient.
For example m(:name, *array) is quite expensive if array is big, m(:name, array) is strictly better from a performance POV.
m(*array) can at best be as fast as m(array), but can be much worse, e.g. if passed on stack (and < 128 for your PR) or if array is a int[].

Of course CRuby devs will decide what they want here.
The real issue is, if CRuby accepts this:

  • There is probably no hope to ever revert that decision and to remove those costs, because some code will likely start to depend on it.
  • It might encourage Ruby users to abuse splats more since they seem not much slower than non-splat on CRuby and they don't trigger SystemStackError.

Updated by jeremyevans0 (Jeremy Evans) 11 months ago

Eregon (Benoit Daloze) wrote in #note-23:

@jeremyevans0 (Jeremy Evans)

I rebased my branch against master, and then ran all of the app_* benchmarks, here are the results:

Are the +N% there improvements or regressions? From those numbers it sounds like + would be regressions (i.e., more time to execute the same thing).

+N% is an improvement in iterations per second, -N% is a decrease in iterations per second.

I am thinking a bit more about the implications of this for Ruby implementations and JITs.
Only passing on the stack means not allowed to pass a huge number of arguments (the case on TruffleRuby).
Only passing as a heap array seems inefficient in general (would cause extra allocations, at least in interpreter, for foo(1, 2)).
I guess one could use 2 different calling conventions, on stack if no rest parameter, on heap if there is a rest parameter. But more calling conventions is a clear cost as it causes extra checks for every call, even more so for polymorphic call site (+ it's messy to do callee-specific logic in the caller).

For CRuby, a heap array is used for large array splats (configurable, but currently 129+ elements). Smaller array splats use the VM stack. A heap array is only used for method calls with argument splat, never for other method calls (even if you pass 129+ arguments). On CRuby, there is a minor cost for checking for whether a heap_allocated array was used.

  • There is probably no hope to ever revert that decision and to remove those costs, because some code will likely start to depend on it.

@ko1 (Koichi Sasada) has already told me that the heap_argv part of the patch will be reverted if it becomes an significant obstacle to future CRuby optimization work. We'll keep the other optimizations in the pull request in that case.

  • It might encourage Ruby users to abuse splats more since they seem not much slower than non-splat on CRuby and they don't trigger SystemStackError.

This is incorrect. Passing arrays via splats is always slower than passing arrays as positional arguments. The pull request makes passing large arrays via splats not trigger SystemStackError. For arrays with 129+ elements, the pull request actually slows down such calls by using a temporary array instead of passing the elements on the VM stack. So if anything, the patch encourages users not to pass large arrays as splats, as doing so is even worse for performance than before.

We may want to consider adding a performance warning for passing large arrays via splats.

Updated by Eregon (Benoit Daloze) 11 months ago

jeremyevans0 (Jeremy Evans) wrote in #note-24:

So if anything, the patch encourages users not to pass large arrays as splats, as doing so is even worse for performance than before.

Thank you for the precision. Then at least it's not going to encourage large splats due to changed performance, good.
Of course it remains a concern that users/code start to depend on this i.e., on no SystemStackError in calls with a large splat, and then changing that back would be a breaking change.
But maybe it's rare and stays rare enough that it might be possible to change it again (e.g. if valuable enough for performance), unsure.

We may want to consider adding a performance warning for passing large arrays via splats.

That's a good idea.

Actions #26

Updated by jeremyevans (Jeremy Evans) 11 months ago

  • Status changed from Open to Closed

Applied in changeset git|99c6d19e502b5fdadbd367ae4b6bb3fab850fddc.


Generalize cfunc large array splat fix to fix many additional cases raising SystemStackError

Originally, when 2e7bceb34ea858649e1f975a934ce1894d1f06a6 fixed cfuncs to no
longer use the VM stack for large array splats, it was thought to have fully
fixed Bug #4040, since the issue was fixed for methods defined in Ruby (iseqs)
back in Ruby 2.2.

After additional research, I determined that same issue affects almost all
types of method calls, not just iseq and cfunc calls. There were two main
types of remaining issues, important cases (where large array splat should
work) and pedantic cases (where large array splat raised SystemStackError
instead of ArgumentError).

Important cases:

define_method(:a){|*a|}
a(*1380888.times)

def b(*a); end
send(:b, *1380888.times)

:b.to_proc.call(self, *1380888.times)

def d; yield(*1380888.times) end
d(&method(:b))

def self.method_missing(*a); end
not_a_method(*1380888.times)

Pedantic cases:

def a; end
a(*1380888.times)
def b(_); end
b(*1380888.times)
def c(_=nil); end
c(*1380888.times)

c = Class.new do
  attr_accessor :a
  alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)

c = Struct.new(:a) do
  alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)

This patch fixes all usage of CALLER_SETUP_ARG with splatting a large
number of arguments, and required similar fixes to use a temporary
hidden array in three other cases where the VM would use the VM stack
for handling a large number of arguments. However, it is possible
there may be additional cases where splatting a large number
of arguments still causes a SystemStackError.

This has a measurable performance impact, as it requires additional
checks for a large number of arguments in many additional cases.

This change is fairly invasive, as there were many different VM
functions that needed to be modified to support this. To avoid
too much API change, I modified struct rb_calling_info to add a
heap_argv member for storing the array, so I would not have to
thread it through many functions. This struct is always stack
allocated, which helps ensure sure GC doesn't collect it early.

Because of how invasive the changes are, and how rarely large
arrays are actually splatted in Ruby code, the existing test/spec
suites are not great at testing for correct behavior. To try to
find and fix all issues, I tested this in CI with
VM_ARGC_STACK_MAX to -1, ensuring that a temporary array is used
for all array splat method calls. This was very helpful in
finding breaking cases, especially ones involving flagged keyword
hashes.

Fixes [Bug #4040]

Co-authored-by: Jimmy Miller

Updated by tenderlovemaking (Aaron Patterson) 6 months ago

Hi,

We (on the YJIT team) have been tracking Ruby performance. We usually look at YJIT performance as compared to the interpreter, but recently we started looking at interpreter performance (it's getting later in the year so we want to make sure things look good for the release). We noticed there was a degradation in interpreter performance in April, and I tracked the issue to this commit.

We were specifically looking at the chunky png benchmark. If you check this graph then click "Time" on the bottom, you can see the increase in time in April.

Running perf stat on the benchmark shows this commit increases the executed instructions from 232,171,088,239 to 244,008,191,171.

Here is the perf stat results I got for e7cdce83e8:

 Performance counter stats for 'ruby -v benchmark.rb':

         18,995.65 msec task-clock                #    1.000 CPUs utilized          
                68      context-switches          #    3.580 /sec                   
                 1      cpu-migrations            #    0.053 /sec                   
            33,003      page-faults               #    1.737 K/sec                  
    82,022,881,262      cycles                    #    4.318 GHz                      (83.32%)
       449,576,108      stalled-cycles-frontend   #    0.55% frontend cycles idle     (83.32%)
    21,629,264,292      stalled-cycles-backend    #   26.37% backend cycles idle      (83.33%)
   232,171,088,239      instructions              #    2.83  insn per cycle         
                                                  #    0.09  stalled cycles per insn  (83.35%)
    40,092,228,662      branches                  #    2.111 G/sec                    (83.37%)
        76,492,242      branch-misses             #    0.19% of all branches          (83.34%)

      18.997823376 seconds time elapsed

      18.904486000 seconds user
       0.091909000 seconds sys

vs perf stat for 99c6d19e50

 Performance counter stats for 'ruby -v benchmark.rb':

         19,415.14 msec task-clock                #    1.000 CPUs utilized          
                66      context-switches          #    3.399 /sec                   
                 2      cpu-migrations            #    0.103 /sec                   
            38,513      page-faults               #    1.984 K/sec                  
    82,876,137,933      cycles                    #    4.269 GHz                      (83.32%)
       577,427,117      stalled-cycles-frontend   #    0.70% frontend cycles idle     (83.33%)
    20,414,833,187      stalled-cycles-backend    #   24.63% backend cycles idle      (83.35%)
   244,008,191,171      instructions              #    2.94  insn per cycle         
                                                  #    0.08  stalled cycles per insn  (83.37%)
    41,385,015,387      branches                  #    2.132 G/sec                    (83.35%)
        76,702,379      branch-misses             #    0.19% of all branches          (83.32%)

      19.417447024 seconds time elapsed

      19.346844000 seconds user
       0.068922000 seconds sys

It's cool we could fix a 13 year old bug, but given the "rarity" of this issue I'm not sure it's worth the slowdown? 😅

I will try to find some way to speed this up, but I'm not sure if I'll have time. In case someone has time to look, the way to reproduce this is by checking out yjit-bench then running the chunky-png benchmark like this:

perf stat ruby -v benchmark.rb

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0