Bug #19991
closedrb_register_postponed_job async-signal-unsafety causes crash in GC
Description
Our production application experienced an interpreter crash today that I’m fairly sure is the result of a bug in the rb_register_postponed_job
infrastructure.
Diagnosis¶
I’ve attached a more complete version of the crash dump, but the key information I think is the following backtrace:
<internal:gc>:35: [BUG] Segmentation fault at 0x00000000000000c8
....
/usr/lib/libruby.so.3.1(rb_vm_bugreport+0x60c) [0xffffaaf57a6c] vm_dump.c:759
/usr/lib/libruby.so.3.1(rb_bug_for_fatal_signal+0xd8) [0xffffaad6d120] error.c:821
/usr/lib/libruby.so.3.1(sigsegv+0x58) [0xffffaaeb4350] signal.c:964
/usr/lib/libruby.so.3.1(sigill) (null):0
linux-vdso.so.1(__kernel_rt_sigreturn+0x0) [0xffffab20f78c]
/usr/lib/libruby.so.3.1(rbimpl_atomic_exchange+0x0) [0xffffaad94018] gc.c:4081
/usr/lib/libruby.so.3.1(gc_finalize_deferred) gc.c:4081
/usr/lib/libruby.so.3.1(rb_postponed_job_flush+0x218) [0xffffaaf5dd60] vm_trace.c:1728
/usr/lib/libruby.so.3.1(rb_threadptr_execute_interrupts+0x328) [0xffffaaef7520] thread.c:2444
/usr/lib/libruby.so.3.1(vm_exec_core+0x4e54) [0xffffaaf41c6c] vm.inc:588
/usr/lib/libruby.so.3.1(rb_vm_exec+0x144) [0xffffaaf427cc] vm.c:2211
I disassembled gc_finalize_deferred
in GDB (different execution, so the addresses don’t line up - this is NOT from a core dump):
(gdb) disassemble gc_finalize_deferred
Dump of assembler code for function gc_finalize_deferred:
0x0000fffff7cc3ff8 <+0>: stp x29, x30, [sp, #-48]!
0x0000fffff7cc3ffc <+4>: mov w1, #0x1 // #1
0x0000fffff7cc4000 <+8>: mov x29, sp
0x0000fffff7cc4004 <+12>: stp x19, x20, [sp, #16]
0x0000fffff7cc4008 <+16>: mov x19, x0
0x0000fffff7cc400c <+20>: add x20, x0, #0xc8
0x0000fffff7cc4010 <+24>: ldaxr w0, [x20]
.... continues ....
Based on the line numbers from the debuginfo, the faulting instruction is likely to be gc_finalize_deferred+24
(which is the inlined rbimpl_atomic_exchange
). That would attempt a load of 0xc8 if x0 was zero - i.e. if gc_finalize_deferred
was called with a NULL objspace.
The enqueuing of gc_finalize_deferred
with postponed job only happens in one place (in gc_sweep_page, here) and if objspace
was null there then the crash would have had to have already happened in gc_sweep_page
. Thus, I think that gc_finalize_deferred
was enqueued into the postponed job list with a not-NULL argument, but the argument was corrupted whilst it was in vm->postponed_job_buffer
, and if objspace was null there then the crash would have had to have already happened in gc_sweep_page
. Thus, I think that gc_finalize_deferred
was enqueued into the postponed job list with a not-NULL argument, but the argument was corrupted whilst it was in vm->postponed_job_buffer
.
I had a look at the postponed job code, which is of course very tricky because it needs to be async-signal-safe. More specifically:
- It needs to work if run from a thread not holding the GVL
- It needs to work if run from a thread, whilst another thread is actually executing
rb_postponed_job_flush
- It needs to work if run from a signal caught in a thread that is currently executing
rb_postponed_job_flush
(this rules out trivial mutex-based solutions)
We use the Datadog continuous profiler in our application (CC: @ivoanjo (Ivo Anjo) ;), which calls rb_postponed_job_register
to capture profiling samples. Thus, I think our application is likely to hit all of those scenarios semi-reguarly.
My best guess at a plausible sequence of events, to generate this particular crash, is that:
-
rb_postponed_job_flush
was running on thread T1. - There is a queued call to gc_finalize_deferred sitting in
vm->postponed_job_buffer[vm->postponed_job_index-1]
. - T1 executed the
ATOMIC_CAS
at vm_trace.c:1800, decrementingvm->postponed_job_index
(which now equalsindex - 1
) and determining that a job at index index should be run. - Thread T2 received a signal, and the Datadog continuous profiler called
rb_postponed_job_register_one(0, sample_from_postponed_job, NULL)
- T2 executed the
ATOMIC_CAS
at vm_trace.c:1679, re-incrementingvm->postponed_job_index
. It’s now equal toindex
from T1 again. - T2 then executes the sets on
pjob->func
andpjob->data
at vm_trace.c:1687. It sets->func
tosample_from_postponed_job
(from ddtrace), and->data
to 0. - T1 then goes to call
(*pjob->func)(pjob->data)
at vm_trace.c:1802 - Since there is no memory barrier between 6 & 7, T1 is allowed to see the effects of the set on
pjob->data
and not see that ofpjob->func
.
9, T1 thus callsgc_finalize_deferred
(which it was meant to do) with an argument of 0 (which it was not).
Solution¶
Managing a thread-safe list of too-big-to-be-atomic objects (like rb_postponed_job_t
) is really tricky. I think it might be best for this code if we use a combination of masking signals (to prevent manipulation of the postponed job list occurring during rb_postponed_job_flush
, and using a semaphore to protect the list (since semaphores are required to be async-signal-safe on POSIX systems). I've implemented this in a PR here: https://github.com/ruby/ruby/pull/8856
It seems slightly slower to do it this way - semaphores require system calls in the uncontended case, which is why they're async-signal-safe but also makes them more expensive than pthread mutexes, which don't, on most systems. I ran my branch through yjit bench:
With my patch:
interp: ruby 3.3.0dev (2023-11-07T08:14:09Z ktsanaktsidis/post.. 342f30f566) [arm64-darwin22]
yjit: ruby 3.3.0dev (2023-11-07T08:14:09Z ktsanaktsidis/post.. 342f30f566) +YJIT [arm64-darwin22]
-------------- ----------- ---------- --------- ---------- ------------ -----------
bench interp (ms) stddev (%) yjit (ms) stddev (%) yjit 1st itr interp/yjit
activerecord 31.2 3.4 17.0 5.7 1.29 1.83
chunky-png 543.5 0.5 367.0 0.7 1.40 1.48
erubi-rails 1044.2 0.6 564.7 1.3 1.69 1.85
hexapdf 1517.6 3.1 917.6 1.2 1.46 1.65
liquid-c 37.1 1.3 28.9 1.4 0.89 1.29
liquid-compile 39.0 1.4 29.9 1.6 0.76 1.30
liquid-render 89.9 1.8 39.6 1.4 1.37 2.27
lobsters 598.2 1.7 435.4 5.2 0.63 1.37
mail 79.8 3.1 52.5 1.0 0.79 1.52
psych-load 1441.5 1.7 885.4 0.5 1.60 1.63
railsbench 1010.8 1.0 609.3 1.3 1.24 1.66
ruby-lsp 40.9 3.4 29.2 30.0 0.66 1.40
sequel 39.8 1.8 33.0 2.4 1.18 1.21
-------------- ----------- ---------- --------- ---------- ------------ -----------
Without the patch:
interp: ruby 3.3.0dev (2023-11-07T07:56:43Z master 5a2779d40f) [arm64-darwin22]
yjit: ruby 3.3.0dev (2023-11-07T07:56:43Z master 5a2779d40f) +YJIT [arm64-darwin22]
-------------- ----------- ---------- --------- ---------- ------------ -----------
bench interp (ms) stddev (%) yjit (ms) stddev (%) yjit 1st itr interp/yjit
activerecord 31.3 3.3 16.7 5.5 1.36 1.88
chunky-png 521.6 0.6 348.8 0.7 1.40 1.50
erubi-rails 1038.9 0.9 566.3 1.2 1.70 1.83
hexapdf 1501.9 1.1 951.7 3.9 1.42 1.58
liquid-c 36.7 1.2 29.3 1.7 0.86 1.25
liquid-compile 38.8 1.1 29.7 3.7 0.73 1.31
liquid-render 92.2 0.9 38.3 1.0 1.47 2.40
lobsters 582.5 2.0 429.8 5.6 0.59 1.36
mail 77.9 1.3 54.8 0.9 0.76 1.42
psych-load 1419.1 0.7 887.7 0.5 1.60 1.60
railsbench 1017.8 1.1 609.9 1.2 1.24 1.67
ruby-lsp 41.0 2.2 28.8 28.8 0.64 1.43
sequel 36.0 1.5 30.4 1.8 1.11 1.18
-------------- ----------- ---------- --------- ---------- ------------ -----------
Maybe this is within the noise floor, but I thought I should bring it up.
Files
Updated by kjtsanaktsidis (KJ Tsanaktsidis) about 1 year ago
Updated by ivoanjo (Ivo Anjo) about 1 year ago
Thanks for sharing this one 😅.
The datadog profiler tries to mimic what stackprof does for interruption (signal + postponed_job_register) but indeed the race you're describing seems feasible, thus making the rb_postponed_job_register
actually not signal-safe as it's intended.
Updated by ivoanjo (Ivo Anjo) 10 months ago
I guess this can be closed since https://github.com/ruby/ruby/pull/8949 was merged? ;)
Updated by jeremyevans0 (Jeremy Evans) 10 months ago
- Status changed from Open to Closed
Updated by ivoanjo (Ivo Anjo) 9 months ago
I've also documented how to workaround this issue in the Datadog Ruby profiler in https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmentation-faults-in-gc_finalize_deferred-in-ruby-versions-26-to-32 although we're still considering putting in a workaround in the profiler itself.