Bug #20197
closedPostponed job invocations are significantly reduced in Ruby 3.3
Description
The number of postponed job invocations has been significantly reduced in Ruby 3.3.
While my understanding is that postponed jobs provide no guarantee of how soon registered callbacks will fire, I believe the current rate is too low for practical usage, especially for profilers such as StackProf.
A git bisect led me to https://github.com/ruby/ruby/commit/1f0304218cf00e05a4a126196676ba221ebf91f6 which obviously seems to be related, but I'm not sure why.
Repro¶
Expected¶
The job fires (nearly) immediately after being triggered.
In the following example, the job is triggered every 100 ms.
% ruby bin/test.rb # runs for 3 seconds
count: 1
count: 2
(snip)
count: 29
Actual¶
The job gets fired only once.
% ruby bin/test.rb
count: 1
Code¶
require 'mycext'
time = Time.now
th = Thread.new do
loop do
sleep 0.01
break if Time.now - time > 3 # run for 3 seconds
end
end
th.join
#include <pthread.h>
#include <stdio.h>
#include <time.h>
#include "ruby.h"
#include "ruby/debug.h"
int called_count;
void
postponed_job(void *ptr)
{
called_count++;
printf("count: %d\n", called_count);
}
_Noreturn void *
pthread_main(void *_)
{
while (1) {
rb_postponed_job_register_one(0, postponed_job, NULL);
// Sleep for 100 ms
struct timespec ts;
ts.tv_sec = 0;
ts.tv_nsec = 100 * 1000 * 1000;
nanosleep(&ts, NULL);
}
}
RUBY_FUNC_EXPORTED void
Init_mycext(void)
{
called_count = 0;
pthread_t pthread;
pthread_create(&pthread, NULL, pthread_main, NULL);
}
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 10 months ago
- Assignee set to kjtsanaktsidis (KJ Tsanaktsidis)
I hacked at the postponed job infrastructure pretty extensively in 3.3, I’ll have a look at this bug in the next few days.
Thank you for your report!
Updated by byroot (Jean Boussier) 10 months ago
especially for profilers such as StackProf.
Not that it makes this less of a bug, but note that recent versions of StackProf (and more mordern profilers like Vernier) avoid relying on postponed jobs, because no matter how frequent they may be triggered, that always give biased results.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 10 months ago
OK, I investigated this a bit. The reason your example only runs the callback once is that the "is-an-interrupt-pending?" flag is a per-thread (technically per execution context) state. Normally, the rb_postponed_job_register_one
(and the new rb_postponed_job_trigger
function as well) will set the interrupt-pending flag on the currently thread, which is what you want when running in a signal handler. However, when you're running from a non-ruby thread, you need to pick a ruby thread to set the flag on.
The previous implementation before https://github.com/ruby/ruby/commit/1f0304218cf00e05a4a126196676ba221ebf91f6 would pick vm->ractor.main_ractor->threads.running_ec
, which is actually the previous thread to run if there is no currently running thread. Now, however, we always pick the main thread (this seems to be because the semantics of running_ec
are different when using the new M:N threading model).
Your example has the main thread sleeping, so it will never check the interrupt-pending flag. Thus, now it doesn't run the callbacks (until the other thread is done).
This PR will I think fix the issue with your reproduction: https://github.com/ruby/ruby/pull/9633.
Are you able to share some more details about your use-case for postponed_job by the way? I ask, because even with this fix (and restoring the Ruby 3.2 behaviour), there are still some race conditions possible (e.g. running_ec
might have just entered a long sleep). If we want to make it possible to robustly run postponed jobs "soon" from non-Ruby threads, I think we would need:
- Add a VM-global interrupt flag
- Have
rb_thread_check_ints
check the VM-global flag as well as its own EC-local flag, - Have
rb_postponed_job_trigger
see if we're running on a Ruby thread, and if not, set the VM-global interrupt flag instead of the EC-local one.
This seems like a fair bit of extra complexity to carry around though if there isn't a use-case which requires it.
Updated by osyoyu (Daisuke Aritomo) 10 months ago
Thank you for your investigation and thorough explanation! I now understand what was happening.
This PR will I think fix the issue with your reproduction: https://github.com/ruby/ruby/pull/9633.
Applying this patch worked perfectly on my environment too.
Are you able to share some more details about your use-case for postponed_job by the way?
I was actually working on my own Ruby profiler https://github.com/osyoyu/pf2 , and was using postponed_job to run rb_profile_thread_frames()
, just like StackProf's original implementation.
The rb_postponed_job_trigger
call could be found in TimerThreadScheduler
.
https://github.com/osyoyu/pf2/blob/v0.2.0/ext/pf2/src/timer_thread_scheduler.rs#L82
(As @byroot (Jean Boussier) pointed out, I'm aware that postponed_job isn't strictly required here; I could send a signal to the target pthread, and call rb_profile_thread_frames()
in the signal handler. But that's probably an another story)
This seems like a fair bit of extra complexity to carry around though if there isn't a use-case which requires it.
From the perspective of a sampling profiler developer, I think it's safe not to implement this technique. I'm trigger
ing postponed jobs periodically (e.g. every 10 ms or so), and one trigger flag being effectively "ignored" (due to be getting registered in a thread which just entered a long sleep) is negligible, as long as subsequent triggers get fired.
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 10 months ago
- Status changed from Open to Closed
Applied in changeset git|ef276858d9295208add48e27208c69184dc50472.
Trigger postponed jobs on running_ec if that is available
Currently, any postponed job triggered from a non-ruby thread gets sent
to the main thread, but if the main thread is sleeping it won't be
checking ints. Instead, we should try and interrupt running_ec if that's
possible, and only fall back to the main thread if it's not.
[Bug #20197]
Updated by byroot (Jean Boussier) 10 months ago
- Backport changed from 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED
@kjtsanaktsidis (KJ Tsanaktsidis) I'm assuming this should be backported to 3.3, but if I'm wrong you can change it to DONTNEED
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 10 months ago
Ah, thank you - yes, this should be backported to 3.3 I think.
Updated by naruse (Yui NARUSE) 8 months ago
- Backport changed from 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: REQUIRED to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONTNEED, 3.3: DONE
ruby_3_3 0793cbbfde261f4fc9bf7045594d62a21e391811 merged revision(s) ef276858d9295208add48e27208c69184dc50472.