Project

General

Profile

Bug #17573

Crashes in profiling tools when signals arrive in non-Ruby threads

Added by jhawthorn (John Hawthorn) about 1 month ago. Updated 3 days ago.

Status:
Open
Priority:
Normal
Target version:
-
ruby -v:
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
[ruby-core:102206]

Description

Stackprof (and likely similar tools) works by setting up a timer to sends it a unix signal on an interval. From that signal handler it does a small amount of internal bookkeeping and calls rb_postponed_job_register_one.

This is a problem because unix signals arrive on an arbitrary thread, and as of Ruby 3.0 the execution context (which rb_postponed_job_register_one relies on) is stored as a thread-local.

This reproduction crashes reliably for me on macos. It doesn't seem to on linux, maybe because the timer thread is different or the kernel has a different "arbitrary" choice. It feels like this is just one of the circumstances this crash could happen.

require "stackprof"

StackProf.run(interval: 100) do
  1000.times do
    GC.start
  end
end
$ ruby crash_stackprof.rb
[BUG] Segmentation fault at 0x0000000000000038
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]

-- Crash Report log information --------------------------------------------
   See Crash Report log file under the one of following:
     * ~/Library/Logs/DiagnosticReports
     * /Library/Logs/DiagnosticReports
   for more details.
Don't forget to include the above Crash Report log file in bug reports.

-- Machine register context ------------------------------------------------
 rax: 0x0000000000000000 rbx: 0x0000000107fbb780 rcx: 0x0000000000000000
 rdx: 0x0000000000000000 rdi: 0x0000000106982c28 rsi: 0x0000000107fbb780
 rbp: 0x000070000eb47a10 rsp: 0x000070000eb479f0  r8: 0x000070000eb47eb0
  r9: 0xd44931e7344c235f r10: 0x00007fff6ef49501 r11: 0x0000000000000202
 r12: 0xd44931e7344c235f r13: 0x00000000ffffffff r14: 0x0000000000000000
 r15: 0x0000000000000000 rip: 0x00000001068c85fd rfl: 0x0000000000010202

-- C level backtrace information -------------------------------------------
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_vm_bugreport+0x6cf) [0x1068c2d5f]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_bug_for_fatal_signal+0x1d6) [0x1066dc556]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(sigsegv+0x5b) [0x10681aa0b]
/usr/lib/system/libsystem_platform.dylib(_sigtramp+0x1d) [0x7fff6efff5fd]
/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby(rb_postponed_job_register_one+0x1d) [0x1068c85fd]
/usr/lib/system/libsystem_platform.dylib(0x7fff6efff5fd) [0x7fff6efff5fd]

0x38 is the address of ((rb_execution_context_t *)0)->vm.

lldb shows that it comes from a second thread which was running timer_pthread_fn

$ lldb =ruby -- ./crash_stackprof.rb
(lldb) target create "/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby"                                                                                                                                                                     ruCurrent executable set to '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64).
(lldb) settings set -- target.run-args  "./crash_stackprof.rb"                                                                                                                                                                          (lldb) run
Process 92893 launched: '/Users/jhawthorn/.rubies/ruby-3.0.0/bin/ruby' (x86_64)                                                                                                                                                         Process 92893 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGALRM
    frame #0: 0x00000001000dfbcd ruby`rgengc_check_relation [inlined] RVALUE_OLD_P_RAW(obj=4303689480) at gc.c:1419:32
   1416 RVALUE_OLD_P_RAW(VALUE obj)
   1417 {
   1418     const VALUE promoted = FL_PROMOTED0 | FL_PROMOTED1;
-> 1419     return (RBASIC(obj)->flags & promoted) == promoted;
   1420 }
   1421
   1422 static inline int                                                                                                                                                                                                                 thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x38)
    frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19                                     1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data)
   1620 {                                                                                                                                                                                                                                  1621     rb_execution_context_t *ec = GET_EC();
-> 1622     rb_vm_t *vm = rb_ec_vm_ptr(ec);
   1623     rb_postponed_job_t *pjob;
   1624     rb_atomic_t i, index;
   1625
Target 0: (ruby) stopped.
(lldb) t 2
* thread #2
    frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19
   1619 rb_postponed_job_register_one(unsigned int flags, rb_postponed_job_func_t func, void *data)
   1620 {
   1621     rb_execution_context_t *ec = GET_EC();
-> 1622     rb_vm_t *vm = rb_ec_vm_ptr(ec);
   1623     rb_postponed_job_t *pjob;
   1624     rb_atomic_t i, index;
   1625
(lldb) bt
* thread #2
  * frame #0: 0x000000010029a5fd ruby`rb_postponed_job_register_one(flags=3492904, func=(stackprof.bundle`stackprof_gc_job_handler at stackprof.c:598), data=0x0000000000000000) at vm_trace.c:1622:19
    frame #1: 0x00007fff6efff5fd libsystem_platform.dylib`_sigtramp + 29
    frame #2: 0x00007fff6ef4e3d7 libsystem_kernel.dylib`poll + 11
    frame #3: 0x0000000100238e1e ruby`timer_pthread_fn(p=<unavailable>) at thread_pthread.c:2189:15
    frame #4: 0x00007fff6f00b109 libsystem_pthread.dylib`_pthread_start + 148
    frame #5: 0x00007fff6f006b8b libsystem_pthread.dylib`thread_start + 15

Attached is my attempted fix (also available at https://github.com/ruby/ruby/pull/4108) which uses the main-ractor's EC if there is none on the current thread. I hope this works (it seems to and fixes the crash) because before Ruby 3.0 there was a global EC, but I'm not entirely sure if this will cause other problems.

If accepted this should be backported to the 3.0 branch.


Files

use_main_ractor_ec_on_threads_without_ec.patch (3.28 KB) use_main_ractor_ec_on_threads_without_ec.patch jhawthorn (John Hawthorn), 01/23/2021 12:07 AM
use_main_ractor_ec_on_threads_without_ec.patch (3.84 KB) use_main_ractor_ec_on_threads_without_ec.patch updated patch jhawthorn (John Hawthorn), 01/24/2021 11:10 PM

Updated by ko1 (Koichi Sasada) about 1 month ago

  • Assignee set to ko1 (Koichi Sasada)
#3

Updated by ko1 (Koichi Sasada) 17 days ago

Ah, OK. This issue doesn't expose on recent Linux system.
Hmm. How to debug it...

#4

Updated by byroot (Jean Boussier) 16 days ago

This issue doesn't expose on recent Linux system

I do have 3.0 receive SEGV avout 50% of the time when using stackprof in CPU mode on ubuntu 20.04. Unfortunately All I get as output is Segmentation fault, so I don't know wether it's the same issue or not.

Updated by byroot (Jean Boussier) 3 days ago

I don't know wether it's the same issue or not.

So I tested this patch on top of the current ruby_3_0 branch, and it does fix the stackprof issue I had.

Also available in: Atom PDF