Bug #17573
closedCrashes in profiling tools when signals arrive in non-Ruby threads
Description
Stackprof (and likely similar tools) works by setting up a timer to sends it a unix signal on an interval. From that signal handler it does a small amount of internal bookkeeping and calls rb_postponed_job_register_one
.
This is a problem because unix signals arrive on an arbitrary thread, and as of Ruby 3.0 the execution context (which rb_postponed_job_register_one
relies on) is stored as a thread-local.
This reproduction crashes reliably for me on macos. It doesn't seem to on linux, maybe because the timer thread is different or the kernel has a different "arbitrary" choice. It feels like this is just one of the circumstances this crash could happen.
require "stackprof"
StackProf.run(interval: 100) do
1000.times do
GC.start
end
end
$ ruby crash_stackprof.rb
[BUG] Segmentation fault at 0x0000000000000038
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
-- Crash Report log information --------------------------------------------
See Crash Report log file under the one of following:
* ~/Library/Logs/DiagnosticReports
* /Library/Logs/DiagnosticReports
for more details.
Don't forget to include the above Crash Report log file in bug reports.
-- Machine register context ------------------------------------------------
rax: 0x0000000000000000 rbx: 0x0000000107fbb780 rcx: 0x0000000000000000
rdx: 0x0000000000000000 rdi: 0x0000000106982c28 rsi: 0x0000000107fbb780
rbp: 0x000070000eb47a10 rsp: 0x000070000eb479f0 r8: 0x000070000eb47eb0
r9: 0xd44931e7344c235f r10: 0x00007fff6ef49501 r11: 0x0000000000000202
r12: 0xd44931e7344c235f r13: 0x00000000ffffffff r14: 0x0000000000000000
r15: 0x0000000000000000 rip: 0x00000001068c85fd rfl: 0x0000000000010202
-- C level backtrace information -------------------------------------------
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_vm_bugreport+0x6cf) [0x1068c2d5f]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_bug_for_fatal_signal+0x1d6) [0x1066dc556]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(sigsegv+0x5b) [0x10681aa0b]
/usr/lib/system/libsystem_platform.dylib(_sigtramp+0x1d) [0x7fff6efff5fd]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_postponed_job_register_one+0x1d) [0x1068c85fd]
/usr/lib/system/libsystem_platform.dylib(0x7fff6efff5fd) [0x7fff6efff5fd]
0x38
is the address of ((rb_execution_context_t *)0)->vm
.
lldb shows that it comes from a second thread which was running timer_pthread_fn
$ lldb =ruby -- ./crash_stackprof.rb
(lldb) target create "/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby" ruCurrent executable set to '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64).
(lldb) settings set -- target.run-args "./crash_stackprof.rb" (lldb) run
Process 92893 launched: '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64) Process 92893 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGALRM
frame #0: 0x00000001000dfbcd ruby`rgengc_check_relation [inlined] RVALUE_OLD_P_RAW(obj=4303689480) at gc.c:1419:32
1416 RVALUE_OLD_P_RAW(VALUE obj)
1417 {
1418 const VALUE promoted = FL_PROMOTED0 | FL_PROMOTED1;
-> 1419 return (RBASIC(obj)->flags & promoted) == promoted;
1420 }
1421
1422 static inline int thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x38)
frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19 1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data)
1620 { 1621 rb_execution_context_t *ec = GET_EC();
-> 1622 rb_vm_t *vm = rb_ec_vm_ptr(ec);
1623 rb_postponed_job_t *pjob;
1624 rb_atomic_t i, index;
1625
Target 0: (ruby) stopped.
(lldb) t 2
* thread #2
frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19
1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data)
1620 {
1621 rb_execution_context_t *ec = GET_EC();
-> 1622 rb_vm_t *vm = rb_ec_vm_ptr(ec);
1623 rb_postponed_job_t *pjob;
1624 rb_atomic_t i, index;
1625
(lldb) bt
* thread #2
* frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19
frame #1: 0x00007fff6efff5fd libsystem_platform.dylib`_sigtramp + 29
frame #2: 0x00007fff6ef4e3d7 libsystem_kernel.dylib`poll + 11
frame #3: 0x0000000100238e1e ruby`timer_pthread_fn(p=<unavailable>) at thread_pthread.c:2189:15
frame #4: 0x00007fff6f00b109 libsystem_pthread.dylib`_pthread_start + 148
frame #5: 0x00007fff6f006b8b libsystem_pthread.dylib`thread_start + 15
Attached is my attempted fix (also available at https://github.com/ruby/ruby/pull/4108) which uses the main-ractor's EC if there is none on the current thread. I hope this works (it seems to and fixes the crash) because before Ruby 3.0 there was a global EC, but I'm not entirely sure if this will cause other problems.
If accepted this should be backported to the 3.0 branch.
Files