Project

General

Profile

Actions

Backport #7693

closed

Thread deadlock (gvl.wait_yield is not (re-)initialized)

Added by apoikos (Apollon Oikonomopoulos) over 11 years ago. Updated over 11 years ago.

Status:
Closed
[ruby-core:51424]

Description

We are running ruby 1.9.3p194 on Debian and are experiencing a deadlock in the
following scenario:

We have a bunch of resque[1] workers equiped with the New Relic agent[2]. The
New Relic agent runs on a separate thread and simply collects process metrics
from /proc/self and submits them over the network.

The workers fork() a child for every job they process. Sporadically, one or two
of the forked children will hang, all of them in the following state:

(gdb) info threads
3 Thread 32111 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
2 Thread 32594 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82

  • 1 Thread 32126 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211

Thread 1 is the New Relic agent thread, which runs as expected (i.e. wakes up
every minute and does work). The weird thing is the state of thread 3 (main
thread) in this case:

(gdb) thread apply 3 bt

Thread 3 (Thread 32111):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007feac8af11a9 in native_cond_wait (cond=0x108cfd4, mutex=0x80) at thread_pthread.c:305
#2 0x00007feac8af47f9 in gvl_yield (limits_us=) at thread_pthread.c:109
#3 rb_thread_schedule_limits (limits_us=) at thread.c:1025
#4 0x00007feac8af48d0 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1327
#5 0x00007feac8ae63ed in vm_exec_core (th=0x108d520, initial=128) at insns.def:1184
#6 0x00007feac8ae7029 in vm_exec (th=) at vm.c:1220
#7 0x00007feac8ae8f03 in invoke_block_from_c (th=, proc=, self=, argc=, argv=, blockptr=) at vm.c:624
#8 rb_vm_invoke_proc (th=, proc=, self=, argc=, argv=, blockptr=) at vm.c:670
#9 0x00007feac89fdd1a in proc_call (argc=1, argv=0x27e3d70, procval=) at proc.c:562
#10 0x00007feac8ae9433 in vm_call0 (th=0x108d520, recv=, id=, argc=1, argv=0x27e3d70, me=0x1191f60) at vm_eval.c:79
#11 0x00007feac8aea2d3 in rb_eval_cmd (cmd=, arg=, level=) at vm_eval.c:1196
#12 0x00007feac8af48e2 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1290
#13 0x00007feac8af4c1c in rb_threadptr_execute_interrupts (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:1335
#14 set_unblock_function (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:269
#15 0x00007feac8af7e84 in rb_thread_blocking_region (func=0x7feac8a5d070 <rb_waitpid_blocking>, data1=0x7fffb5541f80, ubf=0x80, data2=0x1) at thread.c:1129
#16 0x00007feac8a5d174 in rb_waitpid (pid=32593, st=0x7fffb5541fdc, flags=0) at process.c:655
#17 0x00007feac8a5d3c4 in rb_syswait (pid=32593) at process.c:3018
#18 0x00007feac8a1b365 in rb_io_close (io=) at io.c:3838
#19 0x00007feac8a1f0b1 in rb_f_backquote (obj=, str=41828360) at io.c:7519
...

Frame #2 reveals that it is stuck waiting in gvl_yield, in the following
native_cond_wait (thread_pthread.c:109):

/* An another thread is processing GVL yield. */
if (UNLIKELY(vm->gvl.wait_yield)) {
    while (vm->gvl.wait_yield)
        native_cond_wait(&vm->gvl.switch_wait_cond, &vm->gvl.lock);
    goto acquire;
}

Now, by reading the code of gvl_yield(), this would normally only be possible if
we had another thread in native_cond_wait(), which is not the case. For
completeness, here are the backtraces of the other threads:

Thread 1 (New Relic agent)
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
#1 0x00007feac8af2f8e in native_cond_timedwait (cond=0xca1a580, mutex=0xca1a5e8, ts=)
at thread_pthread.c:323
#2 0x00007feac8af45e0 in native_sleep (th=0xca1a4f0, timeout_tv=0x7feabdfe4900) at thread_pthread.c:908
#3 0x00007feac8af4a82 in sleep_timeval (th=0xca1a4f0, tv=...) at thread.c:895
#4 0x00007feac8a5a842 in rb_f_sleep (argc=1, argv=0x7feabde64108) at process.c:3472
#5 0x00007feac8adfc3b in vm_call_cfunc (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=,
flag=, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:404
#6 vm_call_method (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=,
flag=, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:534
#7 0x00007feac8ae2462 in vm_exec_core (th=0xca1a4f0, initial=) at insns.def:1015
#8 0x00007feac8ae7029 in vm_exec (th=) at vm.c:1220
#9 0x00007feac8ae8f03 in invoke_block_from_c (th=, proc=,
self=, argc=, argv=,
blockptr=) at vm.c:624
#10 rb_vm_invoke_proc (th=, proc=, self=,
argc=, argv=, blockptr=) at vm.c:670
#11 0x00007feac8af6898 in thread_start_func_2 (th=, stack_start=0x7feabdfe6000) at thread.c:453
#12 0x00007feac8af68f0 in thread_start_func_1 (th_ptr=0xca1a4f0) at thread_pthread.c:651
#13 0x00007feac87878ca in start_thread (arg=) at pthread_create.c:300
#14 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()

Thread 2 (Timer):
#0 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007feac8af3c92 in thread_timer (p=) at thread_pthread.c:1156
#2 0x00007feac87878ca in start_thread (arg=) at pthread_create.c:300
#3 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()

We also see the following GVL state:

(gdb) frame
#3 rb_thread_schedule_limits (limits_us=) at thread.c:1025
(gdb) p th->vm->gvl.wait_yield
$1 = 1
(gdb) p th->vm->gvl.need_yield
$2 = 0
(gdb) p th->vm->gvl.waiting
$3 = 0

Searching for gvl.wait_yield, we find out that it is never initialized in
gvl_init(), along with the other GVL flags. Since it is only fork()ed children
from multi-threaded parents that sporadically deadlock, and since gvl_init() is
also called at fork time, we suspect that the child ends up with an
inconsistent GVL state when a fork() happens while a thread is waiting in
gvl_yield() (I'm not sure if this is possible though). The child then inherits
a gvl.wait_yield = 1 which is not reset and leads to a deadlock the first time
a gvl_yield() is processed in the new process.

Setting wait_yield to 0 in gvl_init() (see attached patch) seems to fix the issue
for us, I'm not sure it's the proper fix though.

Regards,
Apollon

P.S.: The same issue should affect trunk as well, judging from the code.

[1] https://github.com/defunkt/resque
[2] https://github.com/newrelic/rpm


Files

init-gvl.wait_yield.patch (296 Bytes) init-gvl.wait_yield.patch Patch initializing gvl.wait_yield apoikos (Apollon Oikonomopoulos), 01/14/2013 06:05 PM
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0