Project

General

Profile

Actions

Bug #20243

open

M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux)

Added by kjtsanaktsidis (KJ Tsanaktsidis) 10 months ago. Updated 10 months ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:116615]

Description

When building with Clang 17 and -DVM_CHECK_MODE=1 (with the following configure)

optflags="-ggdb3 -fno-omit-frame-pointer -fno-optimize-sibling-calls -O3" cflags="-DVM_CHECK_MODE=1" CC=clang  ../configure --prefix=/home/kj/ruby/installed --enable-yjit=dev --disable-install-doc

And then running the following script with the built ./miniruby (which is actually from bootstraptest/test_ractor.rb):

counts = []
counts << Ractor.count
ractors = (1..3).map { Ractor.new { Ractor.receive } }
counts << Ractor.count

ractors[0].send('End 0').take
sleep 0.1 until ractors[0].inspect =~ /terminated/
counts << Ractor.count

ractors[1].send('End 1').take
sleep 0.1 until ractors[1].inspect =~ /terminated/
counts << Ractor.count

ractors[2].send('End 2').take
sleep 0.1 until ractors[2].inspect =~ /terminated/
counts << Ractor.count

counts.inspect

I get the following crash:

Assertion Failed: ../vm_core.h:1957:rb_current_execution_context:ec == rb_current_ec_noinline()
ruby 3.4.0dev (2024-02-07T07:52:06Z ktsanaktsidis/igno.. 5cc6d944c2) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0003 p:0003 s:0010 e:000009 METHOD <internal:ractor>:431
c:0002 p:0004 s:0006 e:000005 BLOCK  ractor_crash.rb:3 [FINISH]
c:0001 p:---- s:0003 e:000002 DUMMY  [FINISH]

-- Ruby level backtrace information ----------------------------------------
ractor_crash.rb:3:in `block (2 levels) in <main>'
<internal:ractor>:431:in `receive'

-- Threading information ---------------------------------------------------
Total ractor count: 2
Ruby thread count for this ractor: 1

-- C level backtrace information -------------------------------------------
/home/kj/ruby/build/miniruby(rb_print_backtrace+0x14) [0x55faa97a4ebd] ../vm_dump.c:820
/home/kj/ruby/build/miniruby(rb_vm_bugreport) ../vm_dump.c:1151
/home/kj/ruby/build/miniruby(rb_assert_failure+0x81) [0x55faa94d2719] ../error.c:1131
./miniruby(thread_sched_wait_running_turn+0x2e9) [0x55faa9726f59]
/home/kj/ruby/build/miniruby(rb_ractor_sched_sleep+0x10b) [0x55faa972687b] ../thread_pthread.c:1348
/home/kj/ruby/build/miniruby(ractor_check_ints+0x0) [0x55faa968b328] ../ractor.c:683
/home/kj/ruby/build/miniruby(ractor_sleep_with_cleanup) ../ractor.c:684
/home/kj/ruby/build/miniruby(ractor_sleep+0x15) [0x55faa968adf4] ../ractor.c:701
/home/kj/ruby/build/miniruby(ractor_wait_receive) ../ractor.c:748
/home/kj/ruby/build/miniruby(ractor_receive+0x1f) [0x55faa968768e] ../ractor.c:762
/home/kj/ruby/build/miniruby(builtin_inline_class_431) ../ractor.rb:432
/home/kj/ruby/build/miniruby(builtin_invoker0+0x6) [0x55faa978fc66] ../vm_insnhelper.c:6746
/home/kj/ruby/build/miniruby(invoke_bf+0x39) [0x55faa979816e] ../vm_insnhelper.c:6886
/home/kj/ruby/build/miniruby(vm_invoke_builtin_delegate) ../vm_insnhelper.c:6909
/home/kj/ruby/build/miniruby(rb_vm_check_ints+0x0) [0x55faa9771fac] ../insns.def:1533
/home/kj/ruby/build/miniruby(vm_pop_frame) ../vm_insnhelper.c:419
/home/kj/ruby/build/miniruby(vm_exec_core) ../insns.def:1537
/home/kj/ruby/build/miniruby(vm_exec_loop+0x0) [0x55faa9767f02] ../vm.c:2489
/home/kj/ruby/build/miniruby(rb_vm_exec) ../vm.c:2492
/home/kj/ruby/build/miniruby(invoke_block+0x6f) [0x55faa9781a58] ../vm.c:1512
/home/kj/ruby/build/miniruby(invoke_iseq_block_from_c) ../vm.c:1582
/home/kj/ruby/build/miniruby(invoke_block_from_c_proc) ../vm.c:1680
/home/kj/ruby/build/miniruby(vm_invoke_proc) ../vm.c:1710
/home/kj/ruby/build/miniruby(rb_vm_invoke_proc_with_self+0x5a) [0x55faa9781eaa] ../vm.c:1745
/home/kj/ruby/build/miniruby(thread_do_start_proc+0x199) [0x55faa9739e19] ../thread.c:574
/home/kj/ruby/build/miniruby(thread_do_start+0x6c) [0x55faa973933f] ../thread.c:618
/home/kj/ruby/build/miniruby(thread_start_func_2) ../thread.c:668
/home/kj/ruby/build/miniruby(rb_native_mutex_lock+0x0) [0x55faa973a141] ../thread_pthread.c:2234
/home/kj/ruby/build/miniruby(thread_sched_lock_) ../thread_pthread.c:387
/home/kj/ruby/build/miniruby(call_thread_start_func_2) ../thread_pthread_mn.c:436
/home/kj/ruby/build/miniruby(co_start) ../thread_pthread_mn.c:434

The failing assertion is this one in vm_core.h: https://github.com/ruby/ruby/blob/42c36269403baac67b0d5dc1d6d6e31168cf6a1f/vm_core.h#L1957. It actually has a very helpful comment.

    /* On the shared objects, `__tls_get_addr()` is used to access the TLS
     * and the address of the `ruby_current_ec` can be stored on a function
     * frame. However, this address can be mis-used after native thread
     * migration of a coroutine.
     *   1) Get `ptr =&ruby_current_ec` op NT1 and store it on the frame.
     *   2) Context switch and resume it on the NT2.
     *   3) `ptr` is used on NT2 but it accesses to the TLS on NT1.
     * This assertion checks such misusage.
     *
     * To avoid accidents, `GET_EC()` should be called once on the frame.
     * Note that inlining can produce the problem.
     */
    VM_ASSERT(ec == rb_current_ec_noinline());

What seems to be happening is exactly that. This is a disassembly of the relevant bits of thread_sched_wait_running_turn:

........
   # This is the only bits of the entire function which access the TLS base register %fs.
   # It seems to have spilled the value of ruby_current_ec into %r13.
   0x000055603d2e1cf8 <+136>:	mov    $0xffffffffffffff90,%rax
   0x000055603d2e1cff <+143>:	mov    %fs:0x0,%r12
   0x000055603d2e1d08 <+152>:	add    %rax,%r12
   0x000055603d2e1d0b <+155>:	mov    %fs:(%rax),%r13
........
   # There's a call to coroutine_transfer, so after this point we're returned to on a
   # different thread
   0x000055603d2e1e90 <+544>:	call   0x55603d7fce84 <coroutine_transfer>
   # But nothing ever loads the address of ruby_current_ec from %fs again (i didn't trace
   # exactly the data flow from %r13 at 0x000055603d2e1d0b to here, but i assume it spilled
   # somewhere and now got loaded back into %r15 here). In any case, that means %r15 here
   # contains the value of ruby_current_ec from the _old_ thread, not the current one.
   0x000055603d2e1e95 <+549>:	mov    %rbx,0x28(%r14)
   0x000055603d2e1e99 <+553>:	mov    (%r12),%r15
   0x000055603d2e1e9d <+557>:	call   0x55603d33a010 <rb_current_ec_noinline>
   0x000055603d2e1ea2 <+562>:	cmp    %rax,%r15
=> 0x000055603d2e1ea5 <+565>:	jne    0x55603d2e1f3a <thread_sched_wait_running_turn+714>
........
   # assertion failure code path.
   0x000055603d2e1f3a <+714>:	lea    0x542c0c(%rip),%rdi        # 0x55603d824b4d
   0x000055603d2e1f41 <+721>:	lea    0x542c12(%rip),%rdx        # 0x55603d824b5a
   0x000055603d2e1f48 <+728>:	lea    0x542c28(%rip),%rcx        # 0x55603d824b77
   0x000055603d2e1f4f <+735>:	mov    $0x7a5,%esi
   0x000055603d2e1f54 <+740>:	call   0x55603d08d698 <rb_assert_failure>

if we look at the register values from 0x000055603d2e1ea2:

(rr) print/x $rax
$2 = 0x55603e159ad0
(rr) print/x $r15
$3 = 0x0

So the value from %rax which came from ruby_current_ec_noinline is correctly the value of ruby_current_ec for this thread, and %r15 contains a stale value from a previous thread.

Now, what can we do about this, is a different question :/ There's a really good stackoverflow answer about it here: https://stackoverflow.com/questions/75592038/how-to-disable-clang-expression-elimination-for-thread-local-variable, but to summarise

Other than reimplementing all of our coroutine stuff on top of C++ coroutines, I'm not sure what else we can do. AFAICT there's no way to tell the compiler that we clobbered the %fs register because that's just not a thing in its model (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66631, but i assume clang is similar).

Thoughts? For now I think my workaround is to disable M:N at build time when building with ASAN (or turn optimizations down). At least this isn't a problem with Fiber because we never move them across threads (probably for this reason in part).


Related issues 1 (1 open0 closed)

Related to Ruby master - Misc #20387: Meta-ticket for ASAN supportAssignedkjtsanaktsidis (KJ Tsanaktsidis)Actions

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 10 months ago

yeah you definitely can't tell the compiler you're clobbering %fs

../thread_pthread.c:1166:20: error: unknown register name 'fs' in asm
 1166 |     __asm__("" ::: "fs", "memory");
      |                    ^
1 error generated.
make: *** [Makefile:448: thread.o] Error 1
Actions #2

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 8 months ago

  • Related to Misc #20387: Meta-ticket for ASAN support added
Actions

Also available in: Atom PDF

Like0
Like0Like0