Bug #21685
openUnnecessary context-switching, especially bad on multi-core machines.
Description
While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [here] I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec (10x slower) on a machine with 2 or more cores . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls.
I prepared a fix in [this branch] which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable.
I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4.
After the fix:
- Single-core performance improves by 55%, from 3.6sec to 2sec.
- Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower.
- Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80
- system_time/user_time ratio drops from (1.2 - 1.6) to 0.65
Here are the benchmark results before my change:
# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches: 1140773
nonvoluntary_ctxt_switches: 9487
real 0m3.619s
user 0m1.653s
sys 0m1.950s
# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches: 1400110
nonvoluntary_ctxt_switches: 3
real 0m36.223s
user 0m9.380s
sys 0m14.927s
And after:
# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches: 88
nonvoluntary_ctxt_switches: 899
real 0m2.031s
user 0m1.209s
sys 0m0.743s
# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches: 75
nonvoluntary_ctxt_switches: 8
real 0m2.062s
user 0m1.279s
sys 0m0.783s
I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [here].
Results summary:
- Single-core performance improves (this time by only 30%)
- Multi-core penalty drops from 4x to 0.
- No change to context-switching rates.
- system_time/user_time ratio drops from (0.5-1) to 0.15
Before fix:
# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches: 60
real 0m0.336s
user 0m0.211s
sys 0m0.118s
# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches: 60
real 0m1.424s
user 0m0.468s
sys 0m0.496s
After fix:
# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches: 59
real 0m0.241s
user 0m0.202s
sys 0m0.032s
# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches: 60
real 0m0.238s
user 0m0.195s
sys 0m0.035s
No data to display