Backport #7693

Thread deadlock (gvl.wait_yield is not (re-)initialized)

Added by Apollon Oikonomopoulos about 2 years ago. Updated about 2 years ago.

[ruby-core:51424]
Status:Closed
Priority:Normal
Assignee:Usaku NAKAMURA

Description

We are running ruby 1.9.3p194 on Debian and are experiencing a deadlock in the
following scenario:

We have a bunch of resque[1] workers equiped with the New Relic agent[2]. The
New Relic agent runs on a separate thread and simply collects process metrics
from /proc/self and submits them over the network.

The workers fork() a child for every job they process. Sporadically, one or two
of the forked children will hang, all of them in the following state:

(gdb) info threads
3 Thread 32111 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
2 Thread 32594 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82
* 1 Thread 32126 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211

Thread 1 is the New Relic agent thread, which runs as expected (i.e. wakes up
every minute and does work). The weird thing is the state of thread 3 (main
thread) in this case:

(gdb) thread apply 3 bt

Thread 3 (Thread 32111):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007feac8af11a9 in native_cond_wait (cond=0x108cfd4, mutex=0x80) at thread_pthread.c:305
#2 0x00007feac8af47f9 in gvl_yield (limits_us=) at thread_pthread.c:109
#3 rb_thread_schedule_limits (limits_us=) at thread.c:1025
#4 0x00007feac8af48d0 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1327
#5 0x00007feac8ae63ed in vm_exec_core (th=0x108d520, initial=128) at insns.def:1184
#6 0x00007feac8ae7029 in vm_exec (th=) at vm.c:1220
#7 0x00007feac8ae8f03 in invoke_block_from_c (th=, proc=, self=, argc=, argv=, blockptr=) at vm.c:624
#8 rb_vm_invoke_proc (th=, proc=, self=, argc=, argv=, blockptr=) at vm.c:670
#9 0x00007feac89fdd1a in proc_call (argc=1, argv=0x27e3d70, procval=) at proc.c:562
#10 0x00007feac8ae9433 in vm_call0 (th=0x108d520, recv=, id=, argc=1, argv=0x27e3d70, me=0x1191f60) at vm_eval.c:79
#11 0x00007feac8aea2d3 in rb_eval_cmd (cmd=, arg=, level=) at vm_eval.c:1196
#12 0x00007feac8af48e2 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1290
#13 0x00007feac8af4c1c in rb_threadptr_execute_interrupts (th=0x108d520, func=0x7feac8af5980 , arg=0x108d520, old=0x7fffb5541f28) at thread.c:1335
#14 set_unblock_function (th=0x108d520, func=0x7feac8af5980 , arg=0x108d520, old=0x7fffb5541f28) at thread.c:269
#15 0x00007feac8af7e84 in rb_thread_blocking_region (func=0x7feac8a5d070 , data1=0x7fffb5541f80, ubf=0x80, data2=0x1) at thread.c:1129
#16 0x00007feac8a5d174 in rb_waitpid (pid=32593, st=0x7fffb5541fdc, flags=0) at process.c:655
#17 0x00007feac8a5d3c4 in rb_syswait (pid=32593) at process.c:3018
#18 0x00007feac8a1b365 in rb_io_close (io=) at io.c:3838
#19 0x00007feac8a1f0b1 in rb_f_backquote (obj=, str=41828360) at io.c:7519
...

Frame #2 reveals that it is stuck waiting in gvl_yield, in the following
native_cond_wait (thread_pthread.c:109):

/* An another thread is processing GVL yield. */
if (UNLIKELY(vm->gvl.wait_yield)) {
    while (vm->gvl.wait_yield)
        native_cond_wait(&vm->gvl.switch_wait_cond, &vm->gvl.lock);
    goto acquire;
}

Now, by reading the code of gvl_yield(), this would normally only be possible if
we had another thread in native_cond_wait(), which is not the case. For
completeness, here are the backtraces of the other threads:

Thread 1 (New Relic agent)
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
#1 0x00007feac8af2f8e in native_cond_timedwait (cond=0xca1a580, mutex=0xca1a5e8, ts=)
at thread_pthread.c:323
#2 0x00007feac8af45e0 in native_sleep (th=0xca1a4f0, timeout_tv=0x7feabdfe4900) at thread_pthread.c:908
#3 0x00007feac8af4a82 in sleep_timeval (th=0xca1a4f0, tv=...) at thread.c:895
#4 0x00007feac8a5a842 in rb_f_sleep (argc=1, argv=0x7feabde64108) at process.c:3472
#5 0x00007feac8adfc3b in vm_call_cfunc (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=,
flag=, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:404
#6 vm_call_method (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=,
flag=, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:534
#7 0x00007feac8ae2462 in vm_exec_core (th=0xca1a4f0, initial=) at insns.def:1015
#8 0x00007feac8ae7029 in vm_exec (th=) at vm.c:1220
#9 0x00007feac8ae8f03 in invoke_block_from_c (th=, proc=,
self=, argc=, argv=,
blockptr=) at vm.c:624
#10 rb_vm_invoke_proc (th=, proc=, self=,
argc=, argv=, blockptr=) at vm.c:670
#11 0x00007feac8af6898 in thread_start_func_2 (th=, stack_start=0x7feabdfe6000) at thread.c:453
#12 0x00007feac8af68f0 in thread_start_func_1 (th_ptr=0xca1a4f0) at thread_pthread.c:651
#13 0x00007feac87878ca in start_thread (arg=) at pthread_create.c:300
#14 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()

Thread 2 (Timer):
#0 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007feac8af3c92 in thread_timer (p=) at thread_pthread.c:1156
#2 0x00007feac87878ca in start_thread (arg=) at pthread_create.c:300
#3 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()

We also see the following GVL state:

(gdb) frame
#3 rb_thread_schedule_limits (limits_us=) at thread.c:1025
(gdb) p th->vm->gvl.wait_yield
$1 = 1
(gdb) p th->vm->gvl.need_yield
$2 = 0
(gdb) p th->vm->gvl.waiting
$3 = 0

Searching for gvl.wait_yield, we find out that it is never initialized in
gvl_init(), along with the other GVL flags. Since it is only fork()ed children
from multi-threaded parents that sporadically deadlock, and since gvl_init() is
also called at fork time, we suspect that the child ends up with an
inconsistent GVL state when a fork() happens while a thread is waiting in
gvl_yield() (I'm not sure if this is possible though). The child then inherits
a gvl.wait_yield = 1 which is not reset and leads to a deadlock the first time
a gvl_yield() is processed in the new process.

Setting wait_yield to 0 in gvl_init() (see attached patch) seems to fix the issue
for us, I'm not sure it's the proper fix though.

Regards,
Apollon

P.S.: The same issue should affect trunk as well, judging from the code.

[1] https://github.com/defunkt/resque
[2] https://github.com/newrelic/rpm

init-gvl.wait_yield.patch Magnifier - Patch initializing gvl.wait_yield (296 Bytes) Apollon Oikonomopoulos, 01/14/2013 06:05 PM

Associated revisions

Revision 38819
Added by Motohiro KOSAKI about 2 years ago

  • thread_pthread.c (gvl_init): Reset gvl.wait_yield explicitly when fork()ing. Patch by Apollon Oikonomopoulos. Thanks! [Bug #7693]

Revision 38832
Added by Usaku NAKAMURA about 2 years ago

merge revision(s) 38819: [Backport #7693]

* thread_pthread.c (gvl_init): Reset gvl.wait_yield explicitly when
  fork()ing. Patch by Apollon Oikonomopoulos. Thanks!
  [Bug #7693]

History

#1 Updated by Motohiro KOSAKI about 2 years ago

  • Status changed from Open to Assigned
  • Assignee set to Motohiro KOSAKI

#2 Updated by Motohiro KOSAKI about 2 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r38819.
Apollon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • thread_pthread.c (gvl_init): Reset gvl.wait_yield explicitly when fork()ing. Patch by Apollon Oikonomopoulos. Thanks! [Bug #7693]

#3 Updated by Motohiro KOSAKI about 2 years ago

  • Tracker changed from Bug to Backport
  • Project changed from Ruby trunk to Backport193
  • Category deleted (core)
  • Status changed from Closed to Assigned
  • Assignee changed from Motohiro KOSAKI to Usaku NAKAMURA

#4 Updated by Usaku NAKAMURA about 2 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r38832.
Apollon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 38819: [Backport #7693]

* thread_pthread.c (gvl_init): Reset gvl.wait_yield explicitly when
  fork()ing. Patch by Apollon Oikonomopoulos. Thanks!
  [Bug #7693]

Also available in: Atom PDF