Backport #1471

"Mutual join" deadlock detection faulty in 1.8.6 and 1.8.7

Added by John Carter over 2 years ago. Updated 9 months ago.

[ruby-core:23457]
Status:Assigned Start date:05/15/2009
Priority:Normal Due date:
Assignee:Shyouhei Urabe % Done:

100%

Category:core
Target version:-

Description

If a third thread is involved, the code in eval.c in function
rb_thread_join0 to detect a mutual join dead lock is faulty.

This effects ruby 1.8.6 p368 and ruby 1.8.7 p 160 but not ruby 1.9.1 p
129

What seems to be happening is a sequence like this....

main thread grabs mutex

thread two blocks waiting for mutex

thread one blocks waiting for mutex and th->join is set to the main thread.

the main thread unlocks, which wakes thread two. Thread two now has mutex, thread one is still blocked.

the main thread attempts to join thread one, rb_thread_join0 notes
that thread one is in a WAIT_JOIN waitfor state AND th->join still
points at the main thread and (erroneously) reports deadlock.

ruby-1.8.7-p160/eval.c================================================
rb_thread_join0(th, limit)
    rb_thread_t th;
    double limit;
{
    enum rb_thread_status last_status = THREAD_RUNNABLE;

    if (rb_thread_critical) rb_thread_deadlock();
    if (!rb_thread_dead(th)) {
        if (th == curr_thread)
            rb_raise(rb_eThreadError, "thread 0x%lx tried to join itself",
                     th->thread);
        if ((th->wait_for & WAIT_JOIN) && th->join == curr_thread)
            rb_raise(rb_eThreadError, "Thread#join: deadlock 0x%lx - mutual join(0x%lx)",
                     curr_thread->thread, th->thread);
======================================================================
Affected versions.

/opt/ruby/ruby-1.8.6-p368/bin/ruby -v
ruby 1.8.6 (2009-03-31 patchlevel 368) [i686-linux]

/opt/ruby/ruby-1.8.7-p160/bin/ruby -v
ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux]

Here is a test script to trigger it....
=mutual_join_bug.rb============================================================
require 'thread'

m = Mutex.new
m.lock
wt2 = Thread.new do
   m.lock
   sleep 2
   m.unlock
end

# Ensure wt2 is waiting on m
sleep 0.1

wt1 = Thread.new do
   m.lock
   m.unlock
end
# Ensure wt1 is waiting on m
sleep 0.1

# Give it to wt2
m.unlock

wt1.join
======================================================================

Output...
/opt/ruby/ruby-1.8.7-p160/bin/ruby -w mutual_join_bug.rb mutual_join_bug.rb:24:in `join': Thread#join: deadlock 0xb7daa1bc - mutual join(0xb7d9bc48) (ThreadError)
        from mutual_join_bug.rb:24




The following patch "fixes" it...
======================================================================
--- eval.c_orig	2009-03-23 22:28:31.000000000 +1300
+++ eval.c	2009-05-15 15:15:34.141270568 +1200
@@ -11426,9 +11426,13 @@
 	if (th == curr_thread)
 	    rb_raise(rb_eThreadError, "thread 0x%lx tried to join itself",
 		     th->thread);
+/* This test for mutual join deadlock is faulty in that
+   it doesn't allow for the possibility that a third thread
+   may now hold the mutex.
 	if ((th->wait_for & WAIT_JOIN) && th->join == curr_thread)
 	    rb_raise(rb_eThreadError, "Thread#join: deadlock 0x%lx - mutual join(0x%lx)",
 		     curr_thread->thread, th->thread);
+*/
 	if (curr_thread->status == THREAD_TO_KILL)
 	    last_status = THREAD_TO_KILL;
 	if (limit == 0) return Qfalse;
======================================================================

But I'm not sure it is the correct fix. I'm still experience some
strangeness which I haven't pinned down yet.

I would seem to me that the correct fix is when a mutex is unlocked,
and then locked by some other thread, the th->join pointers of all
threads waiting on that mutex should be updated to point at the thread
now holding the mutex.

mutual_join_bug.rb - Test script. (261 Bytes) John Carter, 05/15/2009 05:19 pm

ruby_1_8_7_reset_waitfor_on_dead_threads.patch - Reset wait_for when joining to a dead thread & check consistency on rb_thread_set_join (1.1 kB) John Carter, 05/19/2009 02:39 pm

ruby_1_8_7_thread_adjust_join.patch - Move adjust join out of wake_on into unlock_mutex_inner (737 Bytes) John Carter, 05/19/2009 02:39 pm

Associated revisions

Revision 23432
Added by nobu over 2 years ago

* eval.c (rb_thread_join), ext/thread/thread.c (wake_one): adjusts targets of rest waiting threads to join. [ruby-core:23457]

History

Updated by Nobuyoshi Nakada over 2 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100
Applied in changeset r23432.

Updated by Nobuyoshi Nakada over 2 years ago

  • Category set to core
  • Status changed from Closed to Assigned
  • Assignee set to Shyouhei Urabe

Updated by Eero Saynatkari over 2 years ago

Sidenote--

Excerpts from message of Fri May 15 11:19:51 +0300 2009:
> Here is a test script to trigger it....
> =mutual_join_bug.rb============================================================
> require 'thread'
> 
> m = Mutex.new
> m.lock
> wt2 = Thread.new do
>    m.lock
>    sleep 2
>    m.unlock
> end
> 
> # Ensure wt2 is waiting on m
> sleep 0.1

The analysis looks correct, but usually you would not want
to rely solely on a sleep here, but also check the #status
before proceeding.


--
Magic is insufficiently advanced technology.

Updated by Brent Roman over 2 years ago

I don't dispute the analysis either, but...
I cannot seem to duplicate this bug in 1.6.8, 1.8.6-p287 or 1.8.7-p72

Is it specific to just the patchlevels mentioned?
Is it likely to be platform specific?
(I'm testing on x86 32-bin Linux)

When I run the mutual_join_bug script, it consistently returns without
outputting anything.

- brent


Eric Hodel-2 wrote:
> 
> Bug #1471: "Mutual join" deadlock detection faulty in 1.8.6 and 1.8.7
> http://redmine.ruby-lang.org/issues/show/1471
> 
> Author: John Carter
> Status: Open, Priority: Normal
> Category: core
> ruby -v: ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux]
> 
> If a third thread is involved, the code in eval.c in function
> rb_thread_join0 to detect a mutual join dead lock is faulty.
> 
> This effects ruby 1.8.6 p368 and ruby 1.8.7 p 160 but not ruby 1.9.1 p
> 129
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/-ruby-core%3A23457---Bug--1471--%22Mutual-join%22-deadlock-detection-faulty-in-1.8.6-and-1.8.7-tp23555444p23571265.html
Sent from the ruby-core mailing list archive at Nabble.com.

Updated by Nobuyoshi Nakada over 2 years ago

Hi,

At Sat, 16 May 2009 16:56:23 +0900,
Brent Roman wrote in [ruby-core:23479]:
> Is it specific to just the patchlevels mentioned?

1.8.6-p364 and 1.8.7-p152.

-- 
Nobu Nakada

Updated by John Carter over 2 years ago

Thanks for the fix...

Unfortunately I think it introduces another bug.

Your patch add invokes rb_thread_set_join from adjust_join from wake_one from unlock_mutex_inner.

Alas, wake_one is _also_ invoked from signal_condvar which is invoked from signal_condvar_call from rb_condvar_signal. Thus all the join pointer of all threads waiting on that conditionvariable will be updated to point at the first thread on the condvar waiting list. Which assumes, perhaps over boldly, that the woken thread will be the one to get the lock.

Now the thread that has just been woken should have be at a ConditionVariable.wait(mutex) point, having come through rb_condvar_wait->wait_condvar->ensure(wait_list,,lock_mutex_call) which will do the right thing.

ie. The call to adjust_join should be moved out of "wake_one" and into "unlock_mutex_inner"

Edited to remove the following which I got wrong.....
 IGNORE THIS LINE->The other place that's going to invoke adjust_join, but shouldn't, is wake_all.

Updated by John Carter over 2 years ago

Here is an alternate patch for ext/thread.c relative to ruby-1.8.7-p160...
--- ext/thread/thread.c.orig	2009-05-18 17:31:50.225602484 +1200
+++ ext/thread/thread.c	2009-05-18 17:32:38.689286319 +1200
@@ -205,6 +205,16 @@
     return ary;
 }

+static void
+adjust_join(const List *list, VALUE new)
+{
+    extern void rb_thread_set_join _((VALUE, VALUE));
+    Entry *entry;
+    for (entry = list->entries; entry; entry = entry->next) {
+	rb_thread_set_join(entry->value, new);
+    }
+}
+
 static VALUE
 wake_thread(VALUE thread)
 {
@@ -471,6 +481,11 @@
     }

     waking = wake_one(&mutex->waiting);
+
+    if (!NIL_P(waking)) {
+	adjust_join(&mutex->waiting, waking);
+    }
+
     mutex->owner = waking;

     return waking;


Even with the fix I'm still getting a deadlock. I will have to look at
it somemore tomorrow.

Updated by John Carter over 2 years ago

Ok, That fix wasn't sufficient. There were still other deadlock issues, which I believe I have now fixed.

In eval.c in rb_thread_schedule we have the following code handling the case of a thread we are waiting to join that has died....

	if (th->wait_for & WAIT_JOIN) {
	    if (rb_thread_dead(th->join)) {
		th->status = THREAD_RUNNABLE;
		found = 1;
	    }

Note that we have updated the th->status, but we have left the th->wait_for unchanged! Adding...
                th->wait_for = 0;
resolved the problem.

In the new function rb_thread_set_join that Nobu added...
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_8/eval.c?r1=23316&r2=23432
he sets th->wait_for to WAIT_JOIN.

I see this as a precondition for this function to be called. 

Attached is two patches relative to 1.8.7-p160

Updated by Nobuyoshi Nakada over 2 years ago

Hi,

At Tue, 19 May 2009 14:39:21 +0900,
John Carter wrote in [ruby-core:23500]:
> Attached is two patches relative to 1.8.7-p160

Couldn't you make a patch against the 1.8 head?

-- 
Nobu Nakada

Updated by Hans de Graaff over 2 years ago

The patches from comment #8 apply cleanly against 1.8.7-p174 and they do solve the threading issues I have with capistrano. Thanks John! 

We will be adding a revision bump of ruby 1.8.7 with these patches in Gentoo shortly.

Updated by Hans de Graaff over 2 years ago

Unfortunatey it seems I was too optimistic in my previous comment. Even with the patches I'm still seeing problems with threading when deploying with capistrano. Consequently we haven't added these to Gentoo just yet.

Bug http://redmine.ruby-lang.org/issues/show/1484 is a duplicate of this bug.

Updated by Daniel Azuma over 2 years ago

@Hans: I do not believe http://redmine.ruby-lang.org/issues/show/1484 is a duplicate of this bug. 1484 generally manifests through capistrano, and it was traced, by the capistrano people (see https://capistrano.lighthouseapp.com/projects/8716/tickets/79-capistrano-hangs-on-shell-command-for-many-computers-on-ruby-186-p368) and independently by me, to an issue with calling IO#select in multiple threads. That issue (1484) has, as far as I can tell, been resolved on the 1.8 head (see http://redmine.ruby-lang.org/issues/show/1993) and is currently assigned for backport to 1.8.7. This current issue (1471) seems to be a distinct issue. So John Carter's patches may be sufficient for this bug.

Updated by Tadashi Saito over 1 year ago

Hi all,

My friend told me that this bug seems to be still alive.
I confirmed it in above version:

  ruby 1.8.7 (2010-06-23 patchlevel 299) [i686-linux]

I checked that ruby_1_8 was modified, but it's never backported to
ruby_1_8_7.  Shyouhei, could you backport this please?

In addition, IMHO, to change bug status "Closed" was not a good
thing because this bug is opened as a bug of 1.8.7, not 1.8.

--
Tadashi Saito

Updated by Tadashi Saito over 1 year ago

Oops, I was wrong.
This is correctly "Assigned." Sorry for misunderstanding.

Updated by Hongli Lai 12 months ago

This bug is still very much alive in Ruby 1.8.7-p334. mutual_join_bug.rb still triggers the bug. Though it is fixed in 1.9.

John Carter's patches fixes the problem in mutual_join_bug.rb, but unfortunately causes a regression. Here's a test case: https://gist.github.com/836861
This test case spawns 50 threads. Each thread repeatedly downloads from microsoft.com, then grabs a lock, increases a counter and prints a progress message. If you let this script run for a few seconds, then interrupt it with Ctrl-C then sometimes it will crash with "[BUG] Internal consistency failure!". Sample session:

  $ ruby threaddownload.rb 
  80287834: Progress = 1
  808b4a68: Progress = 2
  808b35a0: Progress = 3
  808b29fc: Progress = 4
  808b1e30: Progress = 5
  808735f4: Progress = 6
  ^Cthreaddownload.rb:19: [BUG] Internal consistency failure! Expected thread to already be in waiting to join state 8, was in 0

This makes threading on Ruby 1.8 currently nearly useless. I don't have enough expertise to solve this problem unfortunately but I would like to request the core team to put more priority on this bug.

Updated by Hongli Lai 12 months ago

It appears that applying this patch instead of John Carter's patches fixes both the mututal join problem as well as the crashing problems: https://github.com/ruby/ruby/commit/072ccff04debff32072e771bc078fe8bc14ccaad

But it's only applied to the 1_8 branch. If the fix is correct I think it should be backported to 1.8.7 ASAP.

Updated by Shyouhei Urabe 12 months ago

Hmm, I see.  I will backport it.

Also available in: Atom PDF