Backport #7693

Thread deadlock (gvl.wait_yield is not (re-)initialized)

Added by Apollon Oikonomopoulos over 1 year ago. Updated over 1 year ago.

[ruby-core:51424]
Status:Closed
Priority:Normal
Assignee:Usaku NAKAMURA

Description

We are running ruby 1.9.3p194 on Debian and are experiencing a deadlock in the
following scenario:

We have a bunch of resque[1] workers equiped with the New Relic agent[2]. The
New Relic agent runs on a separate thread and simply collects process metrics
from /proc/self and submits them over the network.

The workers fork() a child for every job they process. Sporadically, one or two
of the forked children will hang, all of them in the following state:

(gdb) info threads
3 Thread 32111 pthreadcondwait@@GLIBC2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x8664/pthreadcondwait.S:162
2 Thread 32594 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82
* 1 Thread 32126 pthreadcondtimedwait@@GLIBC2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x8664/pthreadcondtimedwait.S:211

Thread 1 is the New Relic agent thread, which runs as expected (i.e. wakes up
every minute and does work). The weird thing is the state of thread 3 (main
thread) in this case:

(gdb) thread apply 3 bt

Thread 3 (Thread 32111):
#0 pthreadcondwait@@GLIBC2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x8664/pthreadcondwait.S:162
#1 0x00007feac8af11a9 in nativecondwait (cond=0x108cfd4, mutex=0x80) at threadpthread.c:305
#2 0x00007feac8af47f9 in gvl
yield (limitsus=) at threadpthread.c:109
#3 rbthreadschedulelimits (limitsus=) at thread.c:1025
#4 0x00007feac8af48d0 in rbthreadptrexecuteinterruptscommon (th=0x108d520) at thread.c:1327
#5 0x00007feac8ae63ed in vmexeccore (th=0x108d520, initial=128) at insns.def:1184
#6 0x00007feac8ae7029 in vmexec (th=) at vm.c:1220
#7 0x00007feac8ae8f03 in invoke
blockfromc (th=, proc=, self=, argc=, argv=, blockptr=) at vm.c:624
#8 rbvminvokeproc (th=, proc=, self=, argc=, argv=, blockptr=) at vm.c:670
#9 0x00007feac89fdd1a in proc
call (argc=1, argv=0x27e3d70, procval=) at proc.c:562
#10 0x00007feac8ae9433 in vmcall0 (th=0x108d520, recv=, id=, argc=1, argv=0x27e3d70, me=0x1191f60) at vmeval.c:79
#11 0x00007feac8aea2d3 in rbevalcmd (cmd=, arg=, level=) at vmeval.c:1196
#12 0x00007feac8af48e2 in rb
threadptrexecuteinterruptscommon (th=0x108d520) at thread.c:1290
#13 0x00007feac8af4c1c in rb
threadptrexecuteinterrupts (th=0x108d520, func=0x7feac8af5980 , arg=0x108d520, old=0x7fffb5541f28) at thread.c:1335
#14 setunblockfunction (th=0x108d520, func=0x7feac8af5980 , arg=0x108d520, old=0x7fffb5541f28) at thread.c:269
#15 0x00007feac8af7e84 in rbthreadblockingregion (func=0x7feac8a5d070 <rbwaitpidblocking>, data1=0x7fffb5541f80, ubf=0x80, data2=0x1) at thread.c:1129
#16 0x00007feac8a5d174 in rb
waitpid (pid=32593, st=0x7fffb5541fdc, flags=0) at process.c:655
#17 0x00007feac8a5d3c4 in rbsyswait (pid=32593) at process.c:3018
#18 0x00007feac8a1b365 in rb
ioclose (io=) at io.c:3838
#19 0x00007feac8a1f0b1 in rb
f_backquote (obj=, str=41828360) at io.c:7519
...

Frame #2 reveals that it is stuck waiting in gvlyield, in the following
native
condwait (threadpthread.c:109):

/* An another thread is processing GVL yield. */
if (UNLIKELY(vm->gvl.wait_yield)) {
    while (vm->gvl.wait_yield)
        native_cond_wait(&vm->gvl.switch_wait_cond, &vm->gvl.lock);
    goto acquire;
}

Now, by reading the code of gvlyield(), this would normally only be possible if
we had another thread in native
cond_wait(), which is not the case. For
completeness, here are the backtraces of the other threads:

Thread 1 (New Relic agent)
#0 pthreadcondtimedwait@@GLIBC2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x8664/pthreadcondtimedwait.S:211
#1 0x00007feac8af2f8e in nativecondtimedwait (cond=0xca1a580, mutex=0xca1a5e8, ts=)
at threadpthread.c:323
#2 0x00007feac8af45e0 in native
sleep (th=0xca1a4f0, timeouttv=0x7feabdfe4900) at threadpthread.c:908
#3 0x00007feac8af4a82 in sleeptimeval (th=0xca1a4f0, tv=...) at thread.c:895
#4 0x00007feac8a5a842 in rb
fsleep (argc=1, argv=0x7feabde64108) at process.c:3472
#5 0x00007feac8adfc3b in vm
callcfunc (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=,
flag=, id=140646186633392, me=0x1185560, recv=72328920) at vm
insnhelper.c:404
#6 vmcallmethod (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=,
flag=, id=140646186633392, me=0x1185560, recv=72328920) at vminsnhelper.c:534
#7 0x00007feac8ae2462 in vm
execcore (th=0xca1a4f0, initial=) at insns.def:1015
#8 0x00007feac8ae7029 in vm
exec (th=) at vm.c:1220
#9 0x00007feac8ae8f03 in invokeblockfromc (th=, proc=,
self=, argc=, argv=,
blockptr=) at vm.c:624
#10 rb
vminvokeproc (th=, proc=, self=,
argc=, argv=, blockptr=) at vm.c:670
#11 0x00007feac8af6898 in threadstartfunc2 (th=, stackstart=0x7feabdfe6000) at thread.c:453
#12 0x00007feac8af68f0 in threadstartfunc1 (thptr=0xca1a4f0) at threadpthread.c:651
#13 0x00007feac87878ca in start
thread (arg=) at pthreadcreate.c:300
#14 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86
64/clone.S:112
#15 0x0000000000000000 in ?? ()

Thread 2 (Timer):
#0 0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82
#1 0x00007feac8af3c92 in threadtimer (p=) at threadpthread.c:1156
#2 0x00007feac87878ca in startthread (arg=) at pthreadcreate.c:300
#3 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4 0x0000000000000000 in ?? ()

We also see the following GVL state:

(gdb) frame
#3 rbthreadschedulelimits (limitsus=) at thread.c:1025
(gdb) p th->vm->gvl.waityield
$1 = 1
(gdb) p th->vm->gvl.need
yield
$2 = 0
(gdb) p th->vm->gvl.waiting
$3 = 0

Searching for gvl.waityield, we find out that it is never initialized in
gvl
init(), along with the other GVL flags. Since it is only fork()ed children
from multi-threaded parents that sporadically deadlock, and since gvlinit() is
also called at fork time, we suspect that the child ends up with an
inconsistent GVL state when a fork() happens while a thread is waiting in
gvl
yield() (I'm not sure if this is possible though). The child then inherits
a gvl.waityield = 1 which is not reset and leads to a deadlock the first time
a gvl
yield() is processed in the new process.

Setting waityield to 0 in gvlinit() (see attached patch) seems to fix the issue
for us, I'm not sure it's the proper fix though.

Regards,
Apollon

P.S.: The same issue should affect trunk as well, judging from the code.

[1] https://github.com/defunkt/resque
[2] https://github.com/newrelic/rpm

init-gvl.wait_yield.patch Magnifier - Patch initializing gvl.wait_yield (296 Bytes) Apollon Oikonomopoulos, 01/14/2013 06:05 PM

Associated revisions

Revision 38832
Added by Usaku NAKAMURA over 1 year ago

merge revision(s) 38819: [Backport #7693]

* thread_pthread.c (gvl_init): Reset gvl.wait_yield explicitly when
  fork()ing. Patch by Apollon Oikonomopoulos. Thanks!
  [Bug #7693]

History

#1 Updated by Motohiro KOSAKI over 1 year ago

  • Status changed from Open to Assigned
  • Assignee set to Motohiro KOSAKI

#2 Updated by Motohiro KOSAKI over 1 year ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r38819.
Apollon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • threadpthread.c (gvlinit): Reset gvl.wait_yield explicitly when fork()ing. Patch by Apollon Oikonomopoulos. Thanks! [Bug #7693]

#3 Updated by Motohiro KOSAKI over 1 year ago

  • Tracker changed from Bug to Backport
  • Project changed from ruby-trunk to Backport93
  • Category deleted (core)
  • Status changed from Closed to Assigned
  • Assignee changed from Motohiro KOSAKI to Usaku NAKAMURA

#4 Updated by Usaku NAKAMURA over 1 year ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r38832.
Apollon, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 38819: [Backport #7693]

* thread_pthread.c (gvl_init): Reset gvl.wait_yield explicitly when
  fork()ing. Patch by Apollon Oikonomopoulos. Thanks!
  [Bug #7693]

Also available in: Atom PDF