Bug #20710
closedReducing Hash allocation introduces large performance degradation (probably related to VWA)
Description
I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.
VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.
Reproduce¶
You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.
-
git clone
and checkout bundle install
-
bundle exec rake compile
for C-ext bundle ruby benchmark/benchmark_new_env.rb
The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.
The benchmark results are the following:
# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
new_env 6.426 (±15.6%) i/s - 64.000 in 10.125442s
new_rails_env 0.968 (± 0.0%) i/s - 10.000 in 10.355738s
# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
new_env 4.371 (±22.9%) i/s - 43.000 in 10.150192s
new_rails_env 0.360 (± 0.0%) i/s - 4.000 in 11.313158s
The IPS decreased 1.47x for new_env
case (parsing small RBS env), and 2.69x for new_rails_env
(parsing large RBS env).
Investigation¶
GC.stat¶
GC.stat
indicates the number of minor GCs increases.
# In the RBS repository
require_relative './benchmark/utils'
tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
# before
{:count=>126,
:time=>541,
:marking_time=>496,
:sweeping_time=>45,
:heap_allocated_pages=>702,
:heap_sorted_length=>984,
:heap_allocatable_pages=>282,
:heap_available_slots=>793270,
:heap_live_slots=>787407,
:heap_free_slots=>5863,
:heap_final_slots=>0,
:heap_marked_slots=>757744,
:heap_eden_pages=>702,
:heap_tomb_pages=>0,
:total_allocated_pages=>702,
:total_freed_pages=>0,
:total_allocated_objects=>2220605,
:total_freed_objects=>1433198,
:malloc_increase_bytes=>5872,
:malloc_increase_bytes_limit=>16777216,
:minor_gc_count=>112,
:major_gc_count=>14,
:compact_count=>0,
:read_barrier_faults=>0,
:total_moved_objects=>0,
:remembered_wb_unprotected_objects=>0,
:remembered_wb_unprotected_objects_limit=>4779,
:old_objects=>615704,
:old_objects_limit=>955872,
:oldmalloc_increase_bytes=>210912,
:oldmalloc_increase_bytes_limit=>16777216}
# after
{:count=>255,
:time=>1551,
:marking_time=>1496,
:sweeping_time=>55,
:heap_allocated_pages=>570,
:heap_sorted_length=>1038,
:heap_allocatable_pages=>468,
:heap_available_slots=>735520,
:heap_live_slots=>731712,
:heap_free_slots=>3808,
:heap_final_slots=>0,
:heap_marked_slots=>728727,
:heap_eden_pages=>570,
:heap_tomb_pages=>0,
:total_allocated_pages=>570,
:total_freed_pages=>0,
:total_allocated_objects=>2183278,
:total_freed_objects=>1451566,
:malloc_increase_bytes=>1200,
:malloc_increase_bytes_limit=>16777216,
:minor_gc_count=>242,
:major_gc_count=>13,
:compact_count=>0,
:read_barrier_faults=>0,
:total_moved_objects=>0,
:remembered_wb_unprotected_objects=>0,
:remembered_wb_unprotected_objects_limit=>5915,
:old_objects=>600594,
:old_objects_limit=>1183070,
:oldmalloc_increase_bytes=>8128,
:oldmalloc_increase_bytes_limit=>16777216}
Warming up Hashes¶
The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.
diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
tmpdir = prepare_collection!
+(0..30_000_000).map { {} }
+
Benchmark.ips do |x|
x.time = 10
The results are the following:
# Before
Calculating -------------------------------------
new_env 10.354 (± 9.7%) i/s - 103.000 in 10.013834s
new_rails_env 1.661 (± 0.0%) i/s - 17.000 in 10.282490s
# After
Calculating -------------------------------------
new_env 10.771 (± 9.3%) i/s - 107.000 in 10.010446s
new_rails_env 1.584 (± 0.0%) i/s - 16.000 in 10.178984s
RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO
¶
The RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO
env var also mitigates the performance impact.
In this example, I set RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6
(default: 0.20)
# Before
Calculating -------------------------------------
new_env 10.271 (± 9.7%) i/s - 102.000 in 10.087191s
new_rails_env 1.529 (± 0.0%) i/s - 16.000 in 10.538043s
# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
new_env 11.003 (± 9.1%) i/s - 110.000 in 10.068428s
new_rails_env 1.347 (± 0.0%) i/s - 14.000 in 11.117665s
Additional Information¶
- I applied the same change to Array. But it does not cause this problem.
- I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
- The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
- I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.
- VWA for Hash is introduced since Ruby 3.3. https://github.com/ruby/ruby/blob/73c39a5f93d3ad4514a06158e2bb7622496372b9/doc/NEWS/NEWS-3.3.0.md#gc--memory-management
Acknowledgement¶
@mame (Yusuke Endoh), @ko1 (Koichi Sasada), and @soutaro (Soutaro Matsumoto) helped the investigation. I would like to thank them.
Updated by peterzhu2118 (Peter Zhu) about 1 month ago
Thank you for the detailed report.
I had a discussion about this with @mame (Yusuke Endoh) and @byroot (Jean Boussier) last year in the Ruby core slack: https://ruby.slack.com/archives/C02A3SL0S/p1702910003614609
One of the big differences is that VWA eagerly allocates the AR/ST table for the hash, rather than lazily allocate it. This means that for empty hashes, there is a performance penalty for empty hashes, but the penalty can be reclaimed when we have elements inside of the hash.
Another issue with microbenchmarks is that some of the size pools may be small since there aren't many objects, so it may cause GC to trigger more frequently. We don't see this kind of issues with macrobenchmarks since they keep more objects alive after bootup. I can see that you've observed this issue in your benchmark as well.
In the following microbenchmark, we can see that with hashes <= 8 elements (i.e. AR hashes), the performance in Ruby 3.3 is basically on-par with Ruby 3.2, but when we switch to ST tables (9 elements), we can see that VWA is significantly faster:
ruby 3.2.4 (2024-04-23 revision af471c0e01) [arm64-darwin23]
user system total real
Hash with 0 elements 2.421699 0.013686 2.435385 ( 2.436581)
Hash with 1 elements 2.955857 0.029542 2.985399 ( 3.014737)
Hash with 2 elements 2.891668 0.019301 2.910969 ( 2.921928)
Hash with 3 elements 2.900170 0.015396 2.915566 ( 2.916644)
Hash with 4 elements 2.889895 0.014969 2.904864 ( 2.905188)
Hash with 5 elements 2.895059 0.017253 2.912312 ( 2.912845)
Hash with 6 elements 2.869016 0.014351 2.883367 ( 2.883618)
Hash with 7 elements 2.907134 0.016862 2.923996 ( 2.924871)
Hash with 8 elements 2.926749 0.020445 2.947194 ( 2.956753)
Hash with 9 elements 19.932546 0.551577 20.484123 ( 20.498173)
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
user system total real
Hash with 0 elements 2.591444 0.023060 2.614504 ( 2.616658)
Hash with 1 elements 3.052488 0.030433 3.082921 ( 3.102709)
Hash with 2 elements 3.064385 0.027627 3.092012 ( 3.106096)
Hash with 3 elements 3.036935 0.023353 3.060288 ( 3.063819)
Hash with 4 elements 3.020218 0.022274 3.042492 ( 3.043182)
Hash with 5 elements 3.053680 0.025551 3.079231 ( 3.083070)
Hash with 6 elements 2.991555 0.023347 3.014902 ( 3.017601)
Hash with 7 elements 3.011856 0.026142 3.037998 ( 3.041611)
Hash with 8 elements 3.044671 0.033276 3.077947 ( 3.109949)
Hash with 9 elements 14.873814 0.400856 15.274670 ( 15.309215)
require "benchmark"
TIMES = 100_000_000
Benchmark.bm do |x|
x.report("Hash with 0 elements") do
TIMES.times { {} }
end
x.report("Hash with 1 elements") do
TIMES.times { { a: 0 } }
end
x.report("Hash with 2 elements") do
TIMES.times { { a: 0, b: 0 } }
end
x.report("Hash with 3 elements") do
TIMES.times { { a: 0, b: 0, c: 0 } }
end
x.report("Hash with 4 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0 } }
end
x.report("Hash with 5 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0 } }
end
x.report("Hash with 6 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0 } }
end
x.report("Hash with 7 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0 } }
end
x.report("Hash with 8 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0 } }
end
x.report("Hash with 9 elements") do
TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0, i: 0 } }
end
end
Updated by byroot (Jean Boussier) about 1 month ago
I still think free pages should be in a global pool rather than tied to a specific pool size. I believe that would solve this issue.
And yes we don't see it on macro benchmarks, but it might still cause more frequent GC than necessary.
Updated by mame (Yusuke Endoh) about 1 month ago
@peterzhu2118 (Peter Zhu) This is totally different from the issue we talked about in Slack, which was entirely a micro-benchmark of the speed of generating empty hashes. This problem is much more complex.
This problem is that “reducing the number of object creation can slow down execution time”. If my investigation is correct, the mechanism of the problem is as follows.
- If a program creates more objects, GC often occurs and the heap grows. This decreases the frequency of GC occurrence gradually and and throughput increases.
- If a program creates fewer objects, only minor GCs occur and the heap does not grow. Therefore, minor GCs continue to occur at a high frequency and throughput does not increase.
Therefore, allocating unnecessary objects can be much faster, ironically.
If my understanding is correct, minor GC grows the heap only the first few times, and no matter how many minor GCs occur after that, the heap will not grow. I feel there is room for improvement in this.
I am even unsure if VWA is really involved in this problem. But we have observed this problem with Hash objects and Ruby 3.3. It does not reproduce on Ruby 3.2. We couldn't reproduce this by reducing an object rather than Hash.
Note that this problem is not in a micro-benchmark. It actually occurs in RBS + Steep, a real-world macro benchmark.
Updated by peterzhu2118 (Peter Zhu) about 1 month ago
I implemented @byroot's suggestion in this PR: https://github.com/ruby/ruby/pull/11562
It significantly improves the performance in your benchmark and brings it almost as fast as Ruby 3.2.
Updated by pocke (Masataka Kuwabara) about 1 month ago
@peterzhu2118 (Peter Zhu) Thanks for your work! I've confirmed this PR improves the performance in my environment too.
Updated by peterzhu2118 (Peter Zhu) about 1 month ago
- Status changed from Open to Closed
Applied in changeset git|079ef92b5e59b616d670efe81a33e500f2bf6451.
Implement global allocatable slots and empty pages
[Bug #20710]
This commit introduces moves allocatable slots and empty pages from per
size pool to global. This allows size pools to grow globally and allows
empty pages to move between size pools.
For the benchmark in [Bug #20710], this signficantly improves performance:
Before:
new_env 2.563 (± 0.0%) i/s - 26.000 in 10.226703s
new_rails_env 0.293 (± 0.0%) i/s - 3.000 in 10.318960s
After:
new_env 3.781 (±26.4%) i/s - 37.000 in 10.302374s
new_rails_env 0.911 (± 0.0%) i/s - 9.000 in 10.049337s
In the headline benchmarks on yjit-bench, we see the performance is
basically on-par with before, with ruby-lsp being signficantly faster
and activerecord and erubi-rails being slightly slower:
-------------- ----------- ---------- ----------- ---------- -------------- -------------
bench master (ms) stddev (%) branch (ms) stddev (%) branch 1st itr master/branch
activerecord 452.2 0.3 479.4 0.4 0.96 0.94
chunky-png 1157.0 0.4 1172.8 0.1 0.99 0.99
erubi-rails 905.4 0.3 967.2 0.4 0.94 0.94
hexapdf 3566.6 0.6 3553.2 0.3 1.03 1.00
liquid-c 88.9 0.9 89.0 1.3 0.98 1.00
liquid-compile 93.4 0.9 89.9 3.5 1.01 1.04
liquid-render 224.1 0.7 227.1 0.5 1.00 0.99
lobsters 1052.0 3.5 1067.4 2.1 0.99 0.99
mail 197.1 0.4 196.5 0.5 0.98 1.00
psych-load 2960.3 0.1 2988.4 0.8 1.00 0.99
railsbench 2252.6 0.4 2255.9 0.5 0.99 1.00
rubocop 262.7 1.4 270.1 1.8 1.02 0.97
ruby-lsp 275.4 0.5 242.0 0.3 0.97 1.14
sequel 98.4 0.7 98.3 0.6 1.01 1.00
-------------- ----------- ---------- ----------- ---------- -------------- -------------