Bug #20816
openPotential regression in Ruby 3.3.x (compared with 3.1 and 3.2) regarding fast syscalls and multi-threading.
Description
Ok I'm sorry in advance if this is not really a regression (maybe it's the cost to pay for other improvements) but it still showed as a 30% performance drop in my benchmark so I think it may be worth a look.
For the full story about how I came down this rabbit hole I just wrote a blog post to avoid cluttering the issue: https://updown.io/blog/weird-results-comparing-ruby-3-1-3-2-3-3-with-jemalloc-and-yjit (please read it first)
Here I'm gonna put my last findings and the smallest reproducible example I was able to write.
So basically I noticed that quick syscalls (at least with a good SSD) like File.exists?
or File.mtime
, when used in a multi-threaded environement (Sidekiq in my initial testing but I reduced it to a few Thread.new) are significantly slower in 3.3.x (tested with the same results on 3.3.0, 3.3.2 and 3.3.5 for the moment) in comparison with 3.1.2 or 3.2.5: about twice as slow.
These syscalls are heavily used by the Rails Reloader which is how it had such a huge impact (30%) in my testing.
Here is the small reproduction script I wrote:
#!/usr/bin/env ruby
GC.disable # just to make sure it doesn't skew results
THREADS = (ARGV.first || 10).to_i # choose thread count from CLI
N = 10_000_000 # to make the test longer or shorter
puts "Ruby #{RUBY_VERSION} / YJIT: #{RubyVM::YJIT.enabled?} / #{THREADS} Threads"
Array.new(THREADS).map do
Thread.new do
(N/THREADS).times { File.mtime(__FILE__) }
end
end.each(&:join)
puts "#{N} calls completed"
The goal is just to run a lot of fast syscalls, in various number of threads, and here are the numbers on my machine:
Threads\Ruby | 3.1.2 | 3.3.5 |
---|---|---|
1 | 13.7s | ✅ 14.2s |
2 | 9.7s | ⚠️ 33.4s |
4 | 23.2s | ⚠️ 37.0s |
8 | 21.8s | ⚠️ 39.7s |
16 | 23.1s | ⚠️ 41.5s |
(more numbers in the blog post)
I suspected the new N:M thread scheduler introduced in 3.3, even though it's "disabled" by default on the main Ractor, thinking maybe these changes degraded a bit this (quite specific scenario). So I gave a quick try with the feature ENABLED, to see if it would at least change anything and it did ! (though not was I expected):
Threads\Ruby | 3.1.2 | 3.3.5 | 3.3.5 with RUBY_MN_THREADS=1 |
---|---|---|---|
1 | 13.7s | 14.2s | 14.5s ✅ |
2 | 9.7s | ⚠️ 33.4s | 12.3s 🎉 |
4 | 23.2s | ⚠️ 37.0s | 40.5s 🧐 |
8 | 21.8s | ⚠️ 39.7s | 37.9s 🧐 |
16 | 23.1s | ⚠️ 41.5s | 36.3s 🧐 |
(I can make graphs for these numbers if it helps)
Now I'm a bit stuck in my investigation, I'm not sure what change in 3.3 caused this (the new thread scheduler is a good lead IMO) or if it even is a regression. Maybe this downside is innevitable for other benefits? I am happy to contribute more time to this issue and perform more tests but I would need some guidance on what could be interesting to test, from people who understand a bit better than me how this works internally.
Thanks ! (and sorry for the long read)
Files
Updated by nobu (Nobuyoshi Nakada) 3 months ago
The result on my machine (Apple M2 Pro, 12 processors).
Note that N=5_000_000, a half of your example.
Threads\Ruby | 3.1.4 | 3.2.5 | 3.3.5 | 3.4.0 |
---|---|---|---|---|
1 | 8.818 | 5.628 | 5.943 | 6.066 |
2 | 10.389 | 6.920 | 11.252 | 12.315 |
4 | 18.643 | 15.347 | 20.285 | 20.184 |
8 | 18.766 | 14.994 | 20.155 | 19.923 |
16 | 17.530 | 13.459 | 19.988 | 19.705 |
Updated by adrienjarthon (Adrien Jarthon) 3 months ago
Thanks, these numbers are slightly more "expected" than mine (Ubuntu 24.04), especially for 2 threads, but other that that it looks like the "regression" is also visible on MacOS.
For the record I also tested with ruby master (3.4.0 as of 2024-10-25) but the results were similar to 3.3.5 (like for you).
Updated by byroot (Jean Boussier) 3 months ago
- File Capture d’écran 2024-10-26 à 12.10.55.png Capture d’écran 2024-10-26 à 12.10.55.png added
- File Capture d’écran 2024-10-26 à 12.10.45.png Capture d’écran 2024-10-26 à 12.10.45.png added
- File Capture d’écran 2024-10-26 à 12.14.51.png Capture d’écran 2024-10-26 à 12.14.51.png added
- File Capture d’écran 2024-10-26 à 12.14.44.png Capture d’écran 2024-10-26 à 12.14.44.png added
- Backport changed from 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN to 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED
I can also repro on macOS. Comparing profiles between 3.2 and 3.3, in both case the overwhelming majority of the time is spent in the thread scheduler.
The one big difference I can see is that 3.2 seem to use poll
to wait on threads, while 3.3 uses pthread_cond_wait
.
Updated by luke-gru (Luke Gruber) 14 days ago
I think this is due to the refactor for M:N threads. In the 3.2 release, which is fast even though it still uses condition variables in thread_sched_to_running_common
, the changing of threads was never forced during a blocking operation.
That is, in BLOCKING_REGION (thread.c
), it goes:
-
thread_sched_to_waiting(th)
-- this releases per-ractor GVL (TH_SCHED(th)->running
changes from current ruby thread (GET_THREAD()
) to NULL). Then, signal the next ruby thread in this ractor's readyq, if any. - execute blocking operation, like
File.mtime
-
thread_sched_to_running(th)
-- Blocks until another thread signals it if another thread is running and has the GVL, otherwise continues without blocking
Lots of times, since #2 can be so fast, when we get to #3, sched->running
is still NULL from #1 because the other thread we signalled hasn't woken up yet and taken control of sched->running
. The OS scheduler hasn't switched, so there are less context switches. I see this in some custom printf logging I made, where 1 thread runs many iterations of the syscall, then another thread runs many, etc.
In the new system,
- Similar to before, but if there's a next thread ready (in readyq) we actually force it to hold the GVL (we set
sched->running = next_th
), then we signal it's condvar telling it to run - Execute blocking operation
- Put ourselves in the back of the readyq, and wait until we're signalled
So in summary, there is more context switching, which is the main slowdown I believe. There is also more overhead now due to timeslice bookkeeping which I think can be reduced.
The new ractor/thread scheduling system is more complex than the old one, but I think we can fix this issue. I read that Go does some fancy things with fast syscalls where if a known-fast syscall is made, the goroutine scheduler doesn't signal the next goroutine but posts to a monitoring thread with a small timeout value and if this value is exceeded by the syscall then the next goroutine is scheduled, otherwise it never signals the next goroutine at all, takes itself off the monitoring thread and continues. There's some information here about that: https://utcc.utoronto.ca/~cks/space/blog/programming/GoSchedulerAndSyscalls.
We could do something similar to Go or try to incorporate more of how it used to work into the new system. I think doing something similar to Go makes sense with how the new scheduler works.
Updated by byroot (Jean Boussier) 13 days ago
- Assignee set to ko1 (Koichi Sasada)
@ko1 (Koichi Sasada) would input would be welcome on this issue.