Bug #20243
openM:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux)
Description
When building with Clang 17 and -DVM_CHECK_MODE=1
(with the following configure)
optflags="-ggdb3 -fno-omit-frame-pointer -fno-optimize-sibling-calls -O3" cflags="-DVM_CHECK_MODE=1" CC=clang ../configure --prefix=/home/kj/ruby/installed --enable-yjit=dev --disable-install-doc
And then running the following script with the built ./miniruby
(which is actually from bootstraptest/test_ractor.rb
):
counts = []
counts << Ractor.count
ractors = (1..3).map { Ractor.new { Ractor.receive } }
counts << Ractor.count
ractors[0].send('End 0').take
sleep 0.1 until ractors[0].inspect =~ /terminated/
counts << Ractor.count
ractors[1].send('End 1').take
sleep 0.1 until ractors[1].inspect =~ /terminated/
counts << Ractor.count
ractors[2].send('End 2').take
sleep 0.1 until ractors[2].inspect =~ /terminated/
counts << Ractor.count
counts.inspect
I get the following crash:
Assertion Failed: ../vm_core.h:1957:rb_current_execution_context:ec == rb_current_ec_noinline()
ruby 3.4.0dev (2024-02-07T07:52:06Z ktsanaktsidis/igno.. 5cc6d944c2) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0003 p:0003 s:0010 e:000009 METHOD <internal:ractor>:431
c:0002 p:0004 s:0006 e:000005 BLOCK ractor_crash.rb:3 [FINISH]
c:0001 p:---- s:0003 e:000002 DUMMY [FINISH]
-- Ruby level backtrace information ----------------------------------------
ractor_crash.rb:3:in `block (2 levels) in <main>'
<internal:ractor>:431:in `receive'
-- Threading information ---------------------------------------------------
Total ractor count: 2
Ruby thread count for this ractor: 1
-- C level backtrace information -------------------------------------------
/home/kj/ruby/build/miniruby(rb_print_backtrace+0x14) [0x55faa97a4ebd] ../vm_dump.c:820
/home/kj/ruby/build/miniruby(rb_vm_bugreport) ../vm_dump.c:1151
/home/kj/ruby/build/miniruby(rb_assert_failure+0x81) [0x55faa94d2719] ../error.c:1131
./miniruby(thread_sched_wait_running_turn+0x2e9) [0x55faa9726f59]
/home/kj/ruby/build/miniruby(rb_ractor_sched_sleep+0x10b) [0x55faa972687b] ../thread_pthread.c:1348
/home/kj/ruby/build/miniruby(ractor_check_ints+0x0) [0x55faa968b328] ../ractor.c:683
/home/kj/ruby/build/miniruby(ractor_sleep_with_cleanup) ../ractor.c:684
/home/kj/ruby/build/miniruby(ractor_sleep+0x15) [0x55faa968adf4] ../ractor.c:701
/home/kj/ruby/build/miniruby(ractor_wait_receive) ../ractor.c:748
/home/kj/ruby/build/miniruby(ractor_receive+0x1f) [0x55faa968768e] ../ractor.c:762
/home/kj/ruby/build/miniruby(builtin_inline_class_431) ../ractor.rb:432
/home/kj/ruby/build/miniruby(builtin_invoker0+0x6) [0x55faa978fc66] ../vm_insnhelper.c:6746
/home/kj/ruby/build/miniruby(invoke_bf+0x39) [0x55faa979816e] ../vm_insnhelper.c:6886
/home/kj/ruby/build/miniruby(vm_invoke_builtin_delegate) ../vm_insnhelper.c:6909
/home/kj/ruby/build/miniruby(rb_vm_check_ints+0x0) [0x55faa9771fac] ../insns.def:1533
/home/kj/ruby/build/miniruby(vm_pop_frame) ../vm_insnhelper.c:419
/home/kj/ruby/build/miniruby(vm_exec_core) ../insns.def:1537
/home/kj/ruby/build/miniruby(vm_exec_loop+0x0) [0x55faa9767f02] ../vm.c:2489
/home/kj/ruby/build/miniruby(rb_vm_exec) ../vm.c:2492
/home/kj/ruby/build/miniruby(invoke_block+0x6f) [0x55faa9781a58] ../vm.c:1512
/home/kj/ruby/build/miniruby(invoke_iseq_block_from_c) ../vm.c:1582
/home/kj/ruby/build/miniruby(invoke_block_from_c_proc) ../vm.c:1680
/home/kj/ruby/build/miniruby(vm_invoke_proc) ../vm.c:1710
/home/kj/ruby/build/miniruby(rb_vm_invoke_proc_with_self+0x5a) [0x55faa9781eaa] ../vm.c:1745
/home/kj/ruby/build/miniruby(thread_do_start_proc+0x199) [0x55faa9739e19] ../thread.c:574
/home/kj/ruby/build/miniruby(thread_do_start+0x6c) [0x55faa973933f] ../thread.c:618
/home/kj/ruby/build/miniruby(thread_start_func_2) ../thread.c:668
/home/kj/ruby/build/miniruby(rb_native_mutex_lock+0x0) [0x55faa973a141] ../thread_pthread.c:2234
/home/kj/ruby/build/miniruby(thread_sched_lock_) ../thread_pthread.c:387
/home/kj/ruby/build/miniruby(call_thread_start_func_2) ../thread_pthread_mn.c:436
/home/kj/ruby/build/miniruby(co_start) ../thread_pthread_mn.c:434
The failing assertion is this one in vm_core.h: https://github.com/ruby/ruby/blob/42c36269403baac67b0d5dc1d6d6e31168cf6a1f/vm_core.h#L1957. It actually has a very helpful comment.
/* On the shared objects, `__tls_get_addr()` is used to access the TLS
* and the address of the `ruby_current_ec` can be stored on a function
* frame. However, this address can be mis-used after native thread
* migration of a coroutine.
* 1) Get `ptr =&ruby_current_ec` op NT1 and store it on the frame.
* 2) Context switch and resume it on the NT2.
* 3) `ptr` is used on NT2 but it accesses to the TLS on NT1.
* This assertion checks such misusage.
*
* To avoid accidents, `GET_EC()` should be called once on the frame.
* Note that inlining can produce the problem.
*/
VM_ASSERT(ec == rb_current_ec_noinline());
What seems to be happening is exactly that. This is a disassembly of the relevant bits of thread_sched_wait_running_turn
:
........
# This is the only bits of the entire function which access the TLS base register %fs.
# It seems to have spilled the value of ruby_current_ec into %r13.
0x000055603d2e1cf8 <+136>: mov $0xffffffffffffff90,%rax
0x000055603d2e1cff <+143>: mov %fs:0x0,%r12
0x000055603d2e1d08 <+152>: add %rax,%r12
0x000055603d2e1d0b <+155>: mov %fs:(%rax),%r13
........
# There's a call to coroutine_transfer, so after this point we're returned to on a
# different thread
0x000055603d2e1e90 <+544>: call 0x55603d7fce84 <coroutine_transfer>
# But nothing ever loads the address of ruby_current_ec from %fs again (i didn't trace
# exactly the data flow from %r13 at 0x000055603d2e1d0b to here, but i assume it spilled
# somewhere and now got loaded back into %r15 here). In any case, that means %r15 here
# contains the value of ruby_current_ec from the _old_ thread, not the current one.
0x000055603d2e1e95 <+549>: mov %rbx,0x28(%r14)
0x000055603d2e1e99 <+553>: mov (%r12),%r15
0x000055603d2e1e9d <+557>: call 0x55603d33a010 <rb_current_ec_noinline>
0x000055603d2e1ea2 <+562>: cmp %rax,%r15
=> 0x000055603d2e1ea5 <+565>: jne 0x55603d2e1f3a <thread_sched_wait_running_turn+714>
........
# assertion failure code path.
0x000055603d2e1f3a <+714>: lea 0x542c0c(%rip),%rdi # 0x55603d824b4d
0x000055603d2e1f41 <+721>: lea 0x542c12(%rip),%rdx # 0x55603d824b5a
0x000055603d2e1f48 <+728>: lea 0x542c28(%rip),%rcx # 0x55603d824b77
0x000055603d2e1f4f <+735>: mov $0x7a5,%esi
0x000055603d2e1f54 <+740>: call 0x55603d08d698 <rb_assert_failure>
if we look at the register values from 0x000055603d2e1ea2
:
(rr) print/x $rax
$2 = 0x55603e159ad0
(rr) print/x $r15
$3 = 0x0
So the value from %rax
which came from ruby_current_ec_noinline
is correctly the value of ruby_current_ec
for this thread, and %r15
contains a stale value from a previous thread.
Now, what can we do about this, is a different question :/ There's a really good stackoverflow answer about it here: https://stackoverflow.com/questions/75592038/how-to-disable-clang-expression-elimination-for-thread-local-variable, but to summarise
- longstanding GCC and Clang bugs for this exist and have been marked as WONTFIX (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461, https://github.com/llvm/llvm-project/issues/19551)
- It's even worse than this EC problem - things like
errno
also might be incorrectly persisted across coroutine switches (so e.g. an inlined C library function could in theory seterrno
in another thread, for example) - C++ actually has coroutines now, so this must work for those. Clang at least has fixed some TLS problems in their C++ coroutine implementation (https://github.com/llvm/llvm-project/issues/47179)
Other than reimplementing all of our coroutine stuff on top of C++ coroutines, I'm not sure what else we can do. AFAICT there's no way to tell the compiler that we clobbered the %fs
register because that's just not a thing in its model (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66631, but i assume clang is similar).
Thoughts? For now I think my workaround is to disable M:N at build time when building with ASAN (or turn optimizations down). At least this isn't a problem with Fiber
because we never move them across threads (probably for this reason in part).
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 10 months ago
yeah you definitely can't tell the compiler you're clobbering %fs
../thread_pthread.c:1166:20: error: unknown register name 'fs' in asm
1166 | __asm__("" ::: "fs", "memory");
| ^
1 error generated.
make: *** [Makefile:448: thread.o] Error 1
Updated by kjtsanaktsidis (KJ Tsanaktsidis) 8 months ago
- Related to Misc #20387: Meta-ticket for ASAN support added