Project

General

Profile

Actions

Bug #6278

closed

in `join': deadlock detected (fatal)

Added by kurt@intricatesoftware.com (Kurt Miller) over 12 years ago. Updated over 12 years ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 1.9.3p125 (2012-02-16 revision 34643) [i386-openbsd]
Backport:
[ruby-core:44275]

Description

There is a race condition between native_sleep(), thread_start_func_2() where thread_start_func_2() holds the gvl lock, then in this bit of code:

514 /* wake up joining threads */
515 join_th = th->join_list_head;
516 while (join_th) {
517 if (join_th == main_th) errinfo = Qnil;
518 rb_threadptr_interrupt(join_th);
519 switch (join_th->status) {
520 case THREAD_STOPPED: case THREAD_STOPPED_FOREVER:
521 printf("%p %d\n", join_th, join_th->status);
522 join_th->status = THREAD_RUNNABLE;
523 default: break;
524 }
525 join_th = join_th->join_list_next;
526 }
527
528 rb_threadptr_unlock_all_locking_mutexes(th);
529 if (th != main_th) rb_check_deadlock(th->vm);

The other thread in native_sleep() is interrupted via the rb_threadptr_interrupt(join_th) call above.
Then the join_th->status is set to THREAD_RUNNABLE above.
The other thread blocks trying to get the gvl lock in native_sleep():

914 GVL_UNLOCK_END();

Then the thread in thread_start_func_2() calls rb_check_deadlock():

529 if (th != main_th) rb_check_deadlock(th->vm);

which thinks the thread in native_sleep is deadlocked since it is not in the correct state.

Here are the backtraces for the two threads, I added abort() in check_deadlock_i() when it decides there is a deadlock for a thread:

(gdb) thread 1
[Switching to thread 1 (process 930810973)]#0 0x0d9b248d in kill () from /usr/lib/libc.so.63.0
(gdb) bt
#0 0x0d9b248d in kill () from /usr/lib/libc.so.63.0
#1 0x0da1c925 in abort () at /usr/src/lib/libc/stdlib/abort.c:68
#2 0x0ddd5095 in check_deadlock_i (key=2081393040, val=695888864, found=0x7d300f38) at thread.c:4738
#3 0x0dd6b9ce in st_foreach (table=0x7c162d20, func=0xddd503f <check_deadlock_i>, arg=2100301624)
at st.c:787
#4 0x0ddd526a in rb_check_deadlock (vm=0x7c034a00) at thread.c:4785
#5 0x0ddce8d1 in thread_start_func_2 (th=0x7c64be00, stack_start=0x7d301000) at thread.c:529
#6 0x0ddcd51e in thread_start_func_1 (th_ptr=0x7c64be00) at thread_pthread.c:653
#7 0x097a948e in _rthread_start (v=0x7c60f800) at rthread.c:113
#8 0x0d99ae81 in __tfork_thread () from /usr/lib/libc.so.63.0
(gdb) frame 5
#5 0x0ddce8d1 in thread_start_func_2 (th=0x7c64be00, stack_start=0x7d301000) at thread.c:529
529 if (th != main_th) rb_check_deadlock(th->vm);

(gdb) thread 2
[Switching to thread 2 (process 207424605)]#0 0x0d9a9715 in _thread_sys___thrsleep ()
(gdb) bt
#0 0x0d9a9715 in _thread_sys___thrsleep () from /usr/lib/libc.so.63.0
#1 0x097a7190 in pthread_cond_wait (condp=0x7c034a10, mutexp=0x7c034a08) at rthread_sync.c:473
#2 0x0ddcd0b4 in native_cond_wait (cond=0x7c034a10, mutex=0x7c034a08) at thread_pthread.c:304
#3 0x0ddccb15 in gvl_acquire_common (vm=0x7c034a00) at thread_pthread.c:61
#4 0x0ddccb7f in gvl_acquire (vm=0x7c034a00, th=0x7c034200) at thread_pthread.c:79
#5 0x0ddcd848 in native_sleep (th=0x7c034200, timeout_tv=0x0) at thread_pthread.c:914
#6 0x0ddcf2ea in sleep_forever (th=0x7c034200, deadlockable=1) at thread.c:856
#7 0x0ddcef97 in thread_join_sleep (arg=3485240628) at thread.c:689
#8 0x0dcbdda7 in rb_ensure (b_proc=0xddcef52 <thread_join_sleep>, data1=3485240628,
e_proc=0xddceee6 <remove_from_join_list>, data2=3485240628) at eval.c:744
#9 0x0ddcf096 in thread_join (target_th=0x7c393600, delay=1e+30) at thread.c:722
#10 0x0ddcf191 in thread_join_m (argc=0, argv=0x7c05003c, self=2087355560) at thread.c:803
#11 0x0ddb7125 in call_cfunc (func=0xddcf113 <thread_join_m>, recv=2087355560, len=Variable "len" is not available.
) at vm_insnhelper.c:317
#12 0x0ddbf1bc in vm_call_method (th=0x7c034200, cfp=0x7c0cfef8, num=0, blockptr=0x1, flag=0, id=4736,
me=0x7c166220, recv=2087355560) at vm_insnhelper.c:404
#13 0x0ddc2a60 in vm_exec_core (th=0x7c034200, initial=Variable "initial" is not available.
) at insns.def:1015
#14 0x0ddc9401 in vm_exec (th=0x7c034200) at vm.c:1220
#15 0x0ddc9bc0 in invoke_block_from_c (th=0x7c034200, block=0x7c0cff90, self=2081428100, argc=1,
argv=0xcfbc9700, blockptr=0x0, cref=0x0) at vm.c:624
#16 0x0ddca167 in rb_yield (val=2087355560) at vm.c:654
#17 0x0dc8cad2 in rb_ary_each (array=2087355600) at array.c:1478
#18 0x0ddb710c in call_cfunc (func=0xdc8ca80 <rb_ary_each>, recv=2087355600, len=Variable "len" is not available.
) at vm_insnhelper.c:320
#19 0x0ddbf1bc in vm_call_method (th=0x7c034200, cfp=0x7c0cff7c, num=0, blockptr=0x7c0cff91, flag=0,
id=424, me=0x7c135940, recv=2087355600) at vm_insnhelper.c:404
#20 0x0ddc2a60 in vm_exec_core (th=0x7c034200, initial=Variable "initial" is not available.
) at insns.def:1015
#21 0x0ddc9401 in vm_exec (th=0x7c034200) at vm.c:1220
#22 0x0ddc95b7 in rb_iseq_eval_main (iseqval=2082861760) at vm.c:1461
#23 0x0dcbea29 in ruby_exec_internal (n=0x7c25f2c0) at eval.c:204
#24 0x0dcbea55 in ruby_exec_node (n=0x7c25f2c0) at eval.c:251
#25 0x0dcbfe2e in ruby_run_node (n=0x7c25f2c0) at eval.c:244
#26 0x1c000984 in main (argc=4, argv=0xcfbca128) at main.c:38
(gdb) frame 5
#5 0x0ddcd848 in native_sleep (th=0x7c034200, timeout_tv=0x0) at thread_pthread.c:914
914 GVL_UNLOCK_END();

$ ruby19 -v
ruby 1.9.3p125 (2012-02-16 revision 34643) [i386-openbsd]

To reproduce install ruby-sequel from http://sequel.rubyforge.org/

$ cat cpu_bench.rb
require 'sequel'

def a
1000.times{|j| Sequel.connect(:adapter=>:mock){}}
end

(0..10).map{Thread.new{a}}.each{|t| t.join}

$ cat doit
until [ $? -ne 0 ]; do
ruby19 -I sequel/lib cpu_bench.rb
done
echo $?


Files

deadlock_detection_fix.patch (1.49 KB) deadlock_detection_fix.patch thedarkone (Vit Z), 08/04/2012 10:53 PM

Updated by ko1 (Koichi Sasada) over 12 years ago

  • Assignee set to ko1 (Koichi Sasada)

Thank you for your analysis.
Because I can't understand this situation yet, I'll check it later.

Can I reproduce this issue without 'sequel'?

Is it difficult?

Updated by jeremyevans0 (Jeremy Evans) over 12 years ago

I'm the one who originally came across this issue and wrote the program that triggers this bug (the poorly-named cpu_bench.rb). I tried numerous times to reproduce it without Sequel, but I was unable to. Sequel is pure-ruby code (no C extensions), so I don't think that this type of bug can be caused by Sequel itself. With that program, it's fairly easy to hit the bug (I just tried and it happened 5/5 times).

Updated by mame (Yusuke Endoh) over 12 years ago

  • Status changed from Open to Assigned

Updated by thedarkone (Vit Z) over 12 years ago

=begin
I've tracked this down to a spinlock/deadlock-detection race condition.

Here's a much shorter script to reproduce:

require 'thread'

class SyncHash < Hash
  WRITE_LOCK = Mutex.new
  
  def put_if_absent(key, value)
    unless self[key]
      WRITE_LOCK.synchronize do
        self[key] = value
      end
    end
  end
end

keys = []
250_000.times {|i| keys << i}
thread_count = 20
10.times do |i|
  h = SyncHash.new

  (1..thread_count).map do
    Thread.new do
      my_keys = keys.shuffle
      my_keys.each {|k| h.put_if_absent(k, k)}
    end
  end.map(&:join)

  puts "---#{i}---"
end

Typical run:
$ ruby deadlock.rb
---0---
---1---
---2---
deadlock.rb:21:in join': deadlock detected (fatal) from deadlock.rb:21:in map'
from deadlock.rb:21:in block in <main>' from deadlock.rb:18:in times'
from deadlock.rb:18:in `'

What happens is that a thread waiting for a lock times out from (({native_cond_timedwait})), decrements (({mutex->cond_waiting})) to (({0})) and blocks on (({GVL_UNLOCK_END();})) waiting to acquire the GVL. Meanwhile some other thread (while holding GVL) releases the lock ((({mutex->th = 0}))), calls (({rb_check_deadlock})) and declares a deadlock.

I'm attaching a proposed patch, that adds a (({spinlock_waiting_gvl})) flag to the thread: (({deadlock_detection_fix.patch})).
=end

Updated by kosaki (Motohiro KOSAKI) over 12 years ago

Good catch!

For the record, I'd like to write a detail race. This race need three threads at least.

  1. thread-A call rb_mutex_lock() and got mutex.
  2. thread-B call rb_mutex_lock() and became sleeping in native_cond_wait().
  3. thread-A call rb_mutex_unlock() and call native_cond_signal().
  4. thread-A call rb_mutex_lock() and became sleeping in native_cond_wait().
  5. thread-C call rb_mutex_lock() and became sleeping in native_cond_timedwait(). i'm last one.
  6. thread-B wake up by (3).
  7. thread-B call rb_mutex_unlock() and call native_cond_signal().
  8. thread-B call rb_mutex_lock() and became sleeping in native_cond_timedwait(). i'm last one too.
  9. thread-A and thread-B wake up. and try to get GVL at the same time. (btw, any thread combinations can make this race)

Now, check_deadlock() from thread-A detect thread-B don't have a mutex and mutex->cond_waiting==0.
However, the attached patch doesn't work correctly. When unlocking GVL, we can't touch th object. It's a part of ruby world.
How is this?

diff --git a/thread.c b/thread.c
index d68fead..d0074a2 100644
--- a/thread.c
+++ b/thread.c
@@ -3913,6 +3913,13 @@ lock_interrupt(void *ptr)
}

/*

    • At maximum, only one thread can use cond_timedwait and watch deadlock
    • periodically. Multiple patrol thread (e.g. concurrent deadlock check)
    • introduces new race conditions.
  • */
    +rb_thread_t *patrol_thread = NULL;

+/*

  • call-seq:
  • mutex.lock -> self

@@ -3949,14 +3956,20 @@ rb_mutex_lock(VALUE self)
* vm->sleepr is unstable value. we have to avoid both deadlock
* and busy loop.
*/

  •       if (vm_living_thread_num(th->vm) == th->vm->sleeper) {
    
  •       if ((vm_living_thread_num(th->vm) == th->vm->sleeper) &&
    
  •           !patrol_thread) {
              timeout_ms = 100;
    
  •           patrol_thread = th;
          }
    
  •       GVL_UNLOCK_BEGIN();
          interrupted = lock_func(th, mutex, timeout_ms);
          native_mutex_unlock(&mutex->lock);
          GVL_UNLOCK_END();
    
  •       if (patrol_thread == th)
    
  •           patrol_thread = NULL;
    
  •       reset_unblock_function(th, &oldubf);
    
          th->locking_mutex = Qfalse;
    

@@ -5199,6 +5212,7 @@ rb_check_deadlock(rb_vm_t *vm)

 if (vm_living_thread_num(vm) > vm->sleeper) return;
 if (vm_living_thread_num(vm) < vm->sleeper) rb_bug("sleeper must not be more than vm_living_threa
  • if (patrol_thread && patrol_thread != GET_THREAD()) return;

Updated by thedarkone (Vit Z) over 12 years ago

@kosaki's solution seems to be much better than mine.

+1

Actions #7

Updated by kosaki (Motohiro KOSAKI) over 12 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r36926.
Kurt, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • thread.c (rb_mutex_lock): stop multiple threads use
    pthread_cond_timedwait() concurrently. [Bug #6278] [ruby-core:44275]
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0