Bug #1525

Deadlock in Ruby 1.9's VM caused by ConditionVariable.wait and fork?

Added by hongli (Hongli Lai) almost 3 years ago. Updated about 1 year ago.

[ruby-core:23572]
Status:Closed Start date:05/28/2009
Priority:Normal Due date:
Assignee:ko1 (Koichi Sasada) % Done:

100%

Category:-
Target version:1.9.1
ruby -v:ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-darwin9.6.0]

Description

The following code seems to cause a VM-wide deadlock on 1.9:

  require 'thread'

  lock = Mutex.new
  cond = ConditionVariable.new
  t = Thread.new do
  	lock.synchronize do
  		cond.wait(lock)
  	end
  end

  pid = fork do
  	# Child
  	STDOUT.write "This is the child process.\n"
  	STDOUT.write "Child process exiting.\n"
  end
  STDOUT.write("Child PID = #{pid}\n")
  Process.waitpid(pid)



The expected output is:

  Child PID = xxxx
  This is the child process.
  Child process exiting.

After the exit message, Ruby should exit.

Instead, Ruby 1.9 gives:

  Child PID = 15493
  This is the child process.
  (process hangs here)

Ruby 1.8 does not suffer from this problem.

Upon debugging Ruby, I've found that Ruby is stuck in blocking_region_end(), at the following line:

  native_mutex_lock(&th->vm->global_vm_lock);

blocking_region_end() was called as part of rb_write_internal(), right after writing "This is the child process\n" to stdout. This problem only occurs if there's a background thread that's waiting on a ConditionVariable. If you remove the thread then the deadlock does not occur.

vm_deadlock_fix.diff (466 Bytes) hongli (Hongli Lai), 11/17/2009 09:58 pm


Related issues

related to ruby-trunk - Bug #2025: problem with pthread handling on non NPTL platform Closed 08/31/2009

Associated revisions

Revision 25844
Added by nobu (Nobuyoshi Nakada) over 2 years ago

* thread.c (rb_thread_atfork_internal): reinitialize global lock at fork to get rid of deadlock. based on the patch from Hongli Lai in [ruby-core:26783]. [ruby-core:23572]

History

Updated by hongli (Hongli Lai) almost 3 years ago

It appears that this bug is OS X-specific. On Ubuntu 8.04 it behaves correctly: ruby 1.9.1p129 (2009-05-12 revision 23412) [x86_64-linux]

Updated by yugui (Yuki Sonoda) almost 3 years ago

  • Assignee set to ko1 (Koichi Sasada)
  • Target version set to 1.9.1

Updated by vanjab (Vanja Bucic) almost 3 years ago

Just to chime in on this issue.

It is affecting our company as well. We run our software on apple machines in a server environment.
Our application is a multithreaded daemon process that is accessing mysql database pretty often. Some of these threads fork off a task that may take some time to complete. It has worked well prior to ruby 1.9.  Since then we have tried to find a workaround but to no avail. Any attempts to use fork in our application will result in deadlocks 8/10 times.
It deadlocks in weirdest places, like 'puts'  or mysql.query (which we know is setting global lock internally, but should be thread safe).

Any ideas and attempts to resolve this ASAP are welcome.

Updated by vanjab (Vanja Bucic) almost 3 years ago

To add my test case:
# ------------------------
require 'thread'

$stderr.puts RUBY_VERSION

$pid = 0
$t1 = Thread.new do
  $pid = fork {
    sleep(1)
    $stderr.puts "thread 1 exiting"
    exit
  }
end

$stdout.puts "ok, thread with fork spawned"
$stdout.puts "la la la la"

Process.waitpid($pid)

$stderr.puts "never done"

# ------ outputs ----------
1.9.2
ok, thread with fork spawned
la la la la

Updated by vanjab (Vanja Bucic) almost 3 years ago

In Reply to:
-- IMHO, from the respective of user, although it is hard, try not to use
-- any non-async-signal-safe functions in a forked child process before any
-- exec functions are called.

-- - Tetsu

Just so I can understand the logic, could you rewrite my test case above so that it does not deadlock?
I am not clear which of the functions I used in the test case are non-async-signal-safe or not.

Thanks.

Updated by matz (Yukihiro Matsumoto) almost 3 years ago

Hi,

In message "Re: [ruby-core:24565] Re: [Bug #1525] Deadlock in Ruby 1.9's VM caused by ConditionVariable.wait and fork?"
    on Sun, 26 Jul 2009 22:11:41 +0900, Hongli Lai <hongli@plan99.net> writes:

|In any case, not being able to create threads or doing anything
|complicated in child processes is a serious limitation. This makes
|forking-without-exec in Ruby 1.9 as good as useless. Even
|forking-with-exec is dangerous now. For example, suppose that the child
|process creates a command string to pass to exec(), and creating this
|command string involves malloc()ing memory. Even this isn't safe anymore.
|
|I think Kernel#fork should be made safe as much as possible.

I know what you mean.  But we cannot override the underlying platform
behavior (i.e impossible).  If it's possible, we are glad to adopt.

# but in case of Vanja, it might be able to support by adjusting the
# timing of launching the internal worker thread.  I am not sure yet.

							matz.

Updated by normalperson (Eric Wong) almost 3 years ago

Looking at trunk, there doesn't seem to be any accounting of mutexes to pass to
handlers for pthread_atfork; so the child process will just inherit the mutexes
in an unknown state.

It should be possible to fix the problem by keeping track of all mutexes as
they're created/initialized and registering pthread_atfork handlers
to ensure all mutexes are unlocked when the child starts running.

I'm pretty sure forking in the presence of threads in the parent will always
require a GVL, but I don't think it's too big of an issue otherwise.

Updated by normalperson (Eric Wong) almost 3 years ago

"none <" <tetsu.soh.dev@gmail.com> wrote:
> Eric Wong wrote:
>> It should be possible to fix the problem by keeping track of all mutexes as
>> they're created/initialized and registering pthread_atfork handlers
>> to ensure all mutexes are unlocked when the child starts running.
>>   
> In fact, it is impossible to track all mutexes because the usage of  
> mutexes really
> depends on the underlying implementation.
> For example, the deadlock on this issues doesn't happen on Linux system,  
> even
> not on FreeBSD7.2, but happens on FreeBSD6.4.

Yes, it's not easy; but I think we can start making a best effort and
wait for OSes to catch up.  This lets us start paving the way towards
reducing the reliance on the GVL:

The big system-side offenders are stdio, malloc and resolver...

1. Ruby 1.9 already removed most of stdio dependencies.

2. malloc still happens under a GVL, but I think replacing it with a
Ruby-aware memory allocator that's better integrated with the GC and
thread management would be a good thing anyways.

3. Maybe look at c-ares or even resolv.rb since they'd play nicer
with timeouts anyways... (not too sure on this one).

There's probably a few other things, but I think those are the main
ones that server applications (the ones most likely to use threads+fork)
will care about...

-- 
Eric Wong

Updated by hongli (Hongli Lai) over 2 years ago

The attached patch fixes the problem. Before forking there might be an arbitrary number of threads waiting on the lock, causing it to enter an undefined state after forking, which in turn causes a deadlock on some platforms. This patch reinitializes the global interpreter lock right after forking, which should be safe because all threads are gone right after forking.

I tried this before and it didn't work, but I suspect that that was caused by bug #2371. Now that #2371 has been fixed it would seem that this patch works.

Updated by vanjab (Vanja Bucic) over 2 years ago

Very good news, thanks.  Where do I fetch the latest sources that include your patch so that I can test with our use case?
Thanks.

Updated by hongli (Hongli Lai) over 2 years ago

The patch is to be applied on top of Ruby's SVN sources.

Updated by nobu (Nobuyoshi Nakada) over 2 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100
This issue was solved with changeset r25844.
Hongli, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.

Updated by daniel (Daniel Cavanagh) over 2 years ago

On 25/11/2009, at 5:57 PM, Tanaka Akira wrote:

> In article <4B07C4C5.8060102@plan99.net>,
>  Hongli Lai <hongli@plan99.net> writes:
> 
>>> % ./ruby -e 'fork { puts }'
>>> -e:1: [BUG] native_mutex_unlock return non-zero: 1
>>> ruby 1.9.2dev (2009-11-19 trunk 25848) [x86_64-freebsd6.4]
>> 
>> This is what I get on FreeBSD 7.1-RELEASE:
> 
> FreeBSD 8.0-RELEASE behaves similar to FreeBSD 6.4.
> 
> % uname -mrsv
> FreeBSD 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:48:17 UTC 2009     root@almeida.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  i386
> % ./ruby -e 'fork { puts }'
> -e:1: [BUG] pthread_mutex_unlock: Operation not permitted (EPERM)
> ruby 1.9.2dev (2009-11-25 trunk 25911) [i386-freebsd8.0]
> 
> -- control frame ----------
> c:0009 p:---- s:0020 b:0020 l:000019 d:000019 CFUNC  :write
> c:0008 p:---- s:0018 b:0018 l:000017 d:000017 CFUNC  :puts
> c:0007 p:---- s:0016 b:0016 l:000015 d:000015 CFUNC  :puts
> c:0006 p:0009 s:0013 b:0013 l:0010a4 d:000012 BLOCK  -e:1
> c:0005 p:---- s:0011 b:0011 l:000010 d:000010 FINISH
> c:0004 p:---- s:0009 b:0009 l:000008 d:000008 CFUNC  :fork
> c:0003 p:0009 s:0006 b:0006 l:0010a4 d:000004 EVAL   -e:1
> c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
> c:0001 p:0000 s:0002 b:0002 l:0010a4 d:0010a4 TOP   
> ---------------------------
> -e:1:in `<main>'
> -e:1:in `fork'
> -e:1:in `block in <main>'
> -e:1:in `puts'
> -e:1:in `puts'
> -e:1:in `write'

exactly the same thing happens on netbsd 5.0.1, if that helps

Also available in: Atom PDF