improve GC performance by 5% with builtin_prefetch

Added by bpowers (Bobby Powers) almost 3 years ago. Updated almost 3 years ago.

The mark phase of non-incremental major GC is (I believe) dominated by pointer chasing. One way we can improve that is by prefetching cachelines from memory before they are accessed, to reduce stalls. I did some experimenting, and the following patch reduces the time spent on a full GC from ~ 950 milliseconds to ~ 900 milliseconds, a small but stable improvement. I would love if additional folks have other benchmarks (or could point me at them) to see if this holds up beyond the web service I tested, and whether something like this could be considered for inclusion.

I also attempted a more "principled" approach based on an optimization described in the GC handbook: putting a FIFO queue in front of the mark stack, and prefetching addresses as they enter the queue. However, I wasn't able to see any performance improvement there despite testing a number of queue sizes from 4 to 64. Its possible I implemented this wrong, or misjudged the access patterns (if e.g. the memory of a VALUE is accessed before push_mark_stack is called, it would invalidate this approach). The code for that alternative is here:


Updated by alanwu (Alan Wu) almost 3 years ago

I ran the patch on some included GC benchmarks in the repo and it doesn't seem to be a pure win (built-ruby is the patched version):

$ make benchmark ITEM=gc_ COMPARE_RUBY=/opt/rubies/2.8.0-clean/bin/ruby OPTS=-v
/opt/rubies/2.6.5/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \
	            --executables="compare-ruby::/opt/rubies/2.8.0-clean/bin/ruby -I.ext/common --disable-gem" \
	            --executables="built-ruby::./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems --disable-gem" \
	            $(find ./benchmark -maxdepth 1 -name 'gc_' -o -name '*gc_*.yml' -o -name '*gc_*.rb' | sort) -v
compare-ruby: ruby 2.8.0dev (2020-02-24T06:37:52Z master 8b6e2685a4) [x86_64-darwin19]
built-ruby: ruby 2.8.0dev (2020-02-24T22:54:22Z master 0e08060632) [x86_64-darwin19]
last_commit=gc: prefech objects; seems to improve full GC performance by 5%
                  compare-ruby:   5572210.3 i/s
                    built-ruby:   5411724.0 i/s - 1.03x  slower

                  compare-ruby:   6563814.6 i/s
                    built-ruby:   6410782.4 i/s - 1.02x  slower

                  compare-ruby:   6331068.0 i/s
                    built-ruby:   5942302.6 i/s - 1.07x  slower

                  compare-ruby:   6668692.5 i/s
                    built-ruby:   6599273.4 i/s - 1.01x  slower

                    built-ruby:  83715634.7 i/s
                  compare-ruby:  79921144.5 i/s - 1.05x  slower

                    built-ruby:  65907268.5 i/s
                  compare-ruby:  60669426.5 i/s - 1.09x  slower

                    built-ruby:  90917907.2 i/s
                  compare-ruby:  86138579.7 i/s - 1.06x  slower

                    built-ruby:  77278160.2 i/s
                  compare-ruby:  67541402.9 i/s - 1.14x  slower

                  compare-ruby:         0.4 i/s
                    built-ruby:         0.3 i/s - 1.06x  slower

                    built-ruby:         0.5 i/s
                  compare-ruby:         0.5 i/s - 1.01x  slower

                  compare-ruby:         0.4 i/s
                    built-ruby:         0.4 i/s - 1.06x  slower

These are micro benchmarks though and I don't know how representative they are of real workloads.

Updated by bpowers (Bobby Powers) almost 3 years ago

alanwu (Alan Wu) wrote in #note-1:

I ran the patch on some included GC benchmarks in the repo and it doesn't seem to be a pure win (built-ruby is the patched version):

Thanks! I hadn't seen these. I see roughly similar results locally on these benchmarks; I'll dig in to see if I can understand whats happening.


