Ruby Process Deadlocks With Fork on Mac OS X Lion
|ruby -v:||ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0]||Backport:|
Given a Ruby process that acts like the following:
- Spawn new thread that initializes a TCPSocket
- Execute script using backticks in main thread
there is a chance that it will deadlock on Lion. The GDB traces for the threads show:
- The TCP connecting thread stuck on nativecondwait/threadpthread.c:321 by way of rsockgetaddrinfo/raddrinfo.c:359
- The main thread stuck on read() by way of rbfbackquote/io.c:7266
Meanwhile, in the forked process from rbfbackquote:
- The main thread is stuck at (longer trace): #0 0x00007fff9160c6b6 in semaphorewaittrap () #1 0x00007fff8fc03bc2 in dispatchthreadsemaphorewait () #2 0x00007fff8fc04286 in dispatchoncef () #3 0x00007fff95e12f20 in simodulestaticsearch () #4 0x00007fff95e16a3d in simodulewithname () #5 0x00007fff95e0eac8 in getpwuid () #6 0x00007fff90daa842 in getgroups$DARWINEXTSN () #7 0x000000010b82b020 in rbgroup_member (gid=0) at file.c:1002 #8 0x000000010b82b10f in eaccess (path=0x7fff6b3d3570 "/bin/hostname", mode=1) at file.c:1052 ...
The documentation for getpwuid in Mac OS X Lion states that getpwuid now is threadsafe, much like getpwuidr - however, the values returned by getpwuid are thread local and disposed automatically, as opposed to getpwuidr's allocation of results. The disassembly of semaphorewaittrap and _psynchcvwait both show syscalls being made (I don't know how to go much further here), but the arguments are all void to these functions too when snooping in GDB. I believe that the posix wait and semaphore_wait taking place are in fact making syscalls to wait on a condition variable of the same value - this value is the same due to the shared memory state of the fork.
When an artificial delay ("sleep 1") is introduced after the creation of the TCP connect thread, this deadlock no longer occurs.
Attached is a test script that uses the Instrumental Agent gem for the TCP connect and can reliably cause the deadlock under 1.9.3.
#5 Updated by Sam Goldstein about 1 year ago
- File socket_backtick_test.rb added
We've encountered the same issue on Mac OS X (Lion and Mountain Lion). It's been causing one of our test suites to hang regularly and the process can only be stopped with a kill -9.
We've created a reproduction case which reproduces the issue every time on my machine, running OS 10.7.3, with ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin11.3.0]
The bug seems to be triggered by the timing of calls to eaccess so it may be necessary to mess with the repro case's sleep duration or to sleep(rand) on other machines.
When attaching to the hung process in GDB we get the following backtrace:
0x00007fff96c6c6b6 in semaphorewaittrap ()
#0 0x00007fff96c6c6b6 in semaphorewaittrap ()
#1 0x00007fff9246fbc2 in dispatchthreadsemaphorewait ()
#2 0x00007fff92470286 in dispatchoncef ()
#3 0x00007fff99048cb0 in simodulestaticsearch ()
#4 0x00007fff9904ca05 in simodulewithname ()
#5 0x00007fff99044858 in getpwuid ()
#6 0x00007fff95889842 in getgroups$DARWINEXTSN ()
#7 0x000000010d7d6895 in rbgroupmember ()
#8 0x000000010d7d6c67 in eaccess ()
#9 0x000000010d7bffc1 in dlnfind1 ()
#10 0x000000010d7c01bd in dlnfindexer ()
#11 0x000000010d85073d in procexecv ()
#12 0x000000010d8539e2 in rbprocexec ()
#13 0x000000010d853a9b in rbexecerr ()
#14 0x000000010d84cdb5 in chfuncprotect ()
#15 0x000000010d7cd858 in rbprotect ()
#16 0x000000010d852d2b in rbforkerr ()
#17 0x000000010d7f3c0e in pipeopen ()
#18 0x000000010d7f8e53 in rbfbackquote ()
#19 0x000000010d90e967 in vmcallmethod ()
#20 0x000000010d8fceb8 in vmexeccore ()
#21 0x000000010d901453 in vmexec ()
#22 0x000000010d911201 in loopi ()
#23 0x000000010d7cdaf7 in rbrescue2 ()
#24 0x000000010d8f4cd6 in rbfloop ()
#25 0x000000010d90e967 in vmcallmethod ()
#26 0x000000010d8fceb8 in vmexeccore ()
#27 0x000000010d901453 in vmexec ()
#28 0x000000010d901828 in rbiseqevalmain ()
#29 0x000000010d7cdd62 in rubyexecinternal ()
#30 0x000000010d7d085c in rubyrunnode ()
#31 0x000000010d78e73f in main ()
I've attached the reproduction case to this comment. Let us know if there's anything we can do to help reproduce/resolve this.
#7 Updated by Akira Tanaka about 1 year ago
I guess this problem is caused by Ruby 1.9.3 invokes non-async-signal-safe functions, such as getpwuid(), in a forked child process before exec.
Ruby trunk fixed the cause: dlnfindexe_r() is called from a parent process, for example.
So, I'd like to know Ruby trunk actually fix this problem or not.
(I can't test because I don't have Mac.)
Anyway, I feel the change in Ruby trunk is too big to backport to Ruby 1.9.3, though.
#9 Updated by Sam Goldstein 12 months ago
This does appear to be fixed in Ruby 2.0.0-p0. I'm no longer able to get my reproduction case to hang (It doesn't hang every time in 1.9.3, but it only takes a try or two to trigger the deadlock).
I'm curious if there's any suggested workaround for this in ruby 1.9.3. I develop a library (newrelic) so I can't control the ruby version which it is run under. Generally it's okay if the backticked command fails (since we can recover from that) but deadlocking the whole process is obviously problematic.