Backport #5130

Thread.pass sticks on OpenBSD

Added by Yui NARUSE over 2 years ago. Updated over 2 years ago.

[ruby-core:38647]
Status:Closed
Priority:Normal
Assignee:Yuki Sonoda

Description

=begin
On OpenBSD 4.9, following script will stick.

./miniruby -ve'Thread.new{Thread.pass}'
=end

ruby193_thread_debug.log (3.59 KB) Jeremy Evans, 08/04/2011 05:30 AM

thread.c.diff Magnifier (552 Bytes) Jeremy Evans, 08/04/2011 07:21 AM

thread_2.diff Magnifier (1014 Bytes) Eric Wong, 08/24/2011 09:31 AM

Associated revisions

Revision 34179
Added by Motohiro KOSAKI over 2 years ago

merge revision(s) 33693:

* thread_pthread.c (gvl_yield): don't prevent concurrent sched_yield().
  [Bug #5130] 

History

#1 Updated by Motohiro KOSAKI over 2 years ago

  • Category set to core
  • Status changed from Open to Assigned
  • Assignee set to Motohiro KOSAKI

#2 Updated by Jeremy Evans over 2 years ago

Here's gdb output from attaching to a process (OpenBSD-current, not 4.9), and using ruby instead of miniruby:

$ ruby19 -e 'Thread.new{Thread.pass}'
$ gdb ruby19 17940
(gdb) bt
#0 0x0000000205b31da7 in pthreadcondbroadcast (cond=0x20b4bb040) at /usr/src/lib/libpthread/uthread/uthreadcond.c:655
#1 0x0000000204bb3709 in native
condbroadcast () from /usr/local/lib/libruby19.so.1.91
#2 0x0000000204bb4424 in rb
threadschedulelimits () from /usr/local/lib/libruby19.so.1.91
#3 0x0000000204bb476b in rbthreadschedule () from /usr/local/lib/libruby19.so.1.91
#4 0x0000000204bb4a6e in rbthreadterminateall () from /usr/local/lib/libruby19.so.1.91
#5 0x0000000204abc8da in ruby
cleanup () from /usr/local/lib/libruby19.so.1.91
#6 0x0000000204abc996 in rubyrunnode () from /usr/local/lib/libruby19.so.1.91
#7 0x0000000000400c42 in main ()
(gdb) info threads
4 process 17940, thread 0x208e0f000 (RUNNING) 0x0000000205b31da7 in pthreadcondbroadcast (cond=0x20b4bb040) at /usr/src/lib/libpthread/uthread/uthreadcond.c:655
3 process 17940, thread 0x2034be000 (SELECT
WAIT) threadkernsched (scp=0x0) at /usr/src/lib/libpthread/uthread/uthreadkern.c:495
2 process 17940, thread 0x206059000 (CONDWAIT) _threadkernsched (scp=0x0) at /usr/src/lib/libpthread/uthread/uthreadkern.c:495
1 process 17940, thread 0x20a045000 (RUNNING) threadkernsched (scp=0x0) at /usr/src/lib/libpthread/uthread/uthreadkern.c:495

I'll try to spend some time debugging this tomorrow.

#3 Updated by Motohiro KOSAKI over 2 years ago

  • Status changed from Assigned to Feedback

#4 Updated by Motohiro KOSAKI over 2 years ago

  • ruby -v changed from ruby 1.9.4dev (2011-07-31 trunk 32788) [x86_64-openbsd4.9] to -

$ ruby19 -e 'Thread.new{Thread.pass}'
$ gdb ruby19 17940
(gdb) bt
#0 x0000000205b31da7 in pthreadcondbroadcast (cond=0x20b4bb040) at /usr/src/lib/libpthread/uthread/uthread_cond.c:655

I'm surprised this stack. pthreadcondbroadcast don't have to be hang
up even if the
argument is wrong. I suspect it's OpenBSD bug.
Jeremy, Could you please handle this issue?

#1 x0000000204bb3709 in nativecondbroadcast () from /usr/local/lib/libruby19.so.1.91
#2 x0000000204bb4424 in rbthreadschedulelimits () from /usr/local/lib/libruby19.so.1.91
#3 x0000000204bb476b in rb
threadschedule () from /usr/local/lib/libruby19.so.1.91
#4 x0000000204bb4a6e in rb
threadterminateall () from /usr/local/lib/libruby19.so.1.91

#5 Updated by Jeremy Evans over 2 years ago

I built ruby with -DTHREAD_DEBUG, and am attaching a partial log for ruby19 -e 'Thread.new{Thread.pass}'. It's partial because it goes into an infinite loop at the end. I've included the first two iterations of the loop. I'll be doing more research on this issue, but if you have any suggestions, please let me know.

#6 Updated by Jeremy Evans over 2 years ago

Looking at the thread debug log I uploaded earlier, I guessed that this problem might be caused because a thread was starting after rbthreadterminateall was called. rbthreadterminateall sets vm->inhibitthreadcreation = 1, which prohibits new threads from being created. However, it doesn't prevent created-but-not-yet-started threads from starting in threadstartfunc2. My theory is that threadstartfunc2 should return without starting a thread if vm->inhibitthreadcreation = 1.

The attached patch is a crude form of what I think should happen. It fixes the Thread.new{Thread.pass} case and the related bootstrap test. It currently passes all make check tests except for one. The one test failure is:

46) Error:
testsignalrequiring(TestSignal):
EOFError: end of file reached
/usr/obj/ports/ruby-1.9.3-preview1/ruby-1.9.3-preview1/test/ruby/testsignal.rb:218:in load'
/usr/obj/ports/ruby-1.9.3-preview1/ruby-1.9.3-preview1/test/ruby/test_signal.rb:218:in
block in test
signalrequiring'
/usr/obj/ports/ruby-1.9.3-preview1/ruby-1.9.3-preview1/test/ruby/test
signal.rb:204:in popen'
/usr/obj/ports/ruby-1.9.3-preview1/ruby-1.9.3-preview1/test/ruby/test_signal.rb:204:in
testsignalrequiring'

The current patch is probably missing something small in terms of thread cleanup, but as I'm not familiar with the thread code, I'm not sure what that might be.

#7 Updated by Eric Wong over 2 years ago

Jeremy: I think your st_delete call is incorrect, can you try my updated patch?

I also rearranged rbregistersigaltstack() to build with SIGALTSTACK.

#8 Updated by Eric Wong over 2 years ago

Eric Wong normalperson@yhbt.net wrote:

File thread_2.diff added

I think both thread*.diff are racy since inhibitthreadcreation is
checked outside of GVL and I'm not sure if the idea with these
patches is correct.

Unfortunately I cannot reproduce this issue on Linux, but I would
add debug prints around loops where nativecondwait() is called
in gvlyield() and also before schedyield().

I also remember sched_yield() doesn't work on some OSes, maybe try to
change that to nanosleep()/select()?

#9 Updated by Jeremy Evans over 2 years ago

Eric Wong wrote:

Jeremy: I think your st_delete call is incorrect, can you try my updated patch?

I also rearranged rbregistersigaltstack() to build with SIGALTSTACK.

No change after replacing my patch with yours, still getting the "end of file reached" EOFError on that signal test.

#10 Updated by Jeremy Evans over 2 years ago

It's been a couple of months since this issue was last updated, and about three months since this issue was last updated by a ruby committer. Can a ruby committer look at either Eric's patch or my patch and let us know if this is an appropriate fix? I can't think of a reason why you would want to start a thread if the interpreter is shutting down, so the patch makes sense to me. Eric is right that this doesn't fix the race condition, but it should at least reduce the probability of it happening.

Looking at the thread log I sent earlier, the thread that starts after rbthreadterminateall is called calls rbthreadschedulelimits, but gvl_yield apparently never changes the thread pointer, so it always yields to itself.

My theory is that when rbthreadterminateall calls terminatei on a thread that has been created but not started, the thread never gets the terminate signal, so it continues to run indefinitely. So I think the patch makes sense. Alternatively threadstartfunc2 could check th->thrownerrinfo or th->status instead of GETVM()->inhibitthread_creation. I haven't tried that, but it could be a more general fix.

#11 Updated by Yui NARUSE over 2 years ago

  • Status changed from Feedback to Assigned
  • ruby -v changed from - to ruby 2.0.0dev (2011-11-07 trunk 33648) [x86_64-openbsd5.0]

#12 Updated by Motohiro KOSAKI over 2 years ago

First, we can't apply your nor Eric's patch because it's racy. Your patch care following scenario.

1) thread-A create thread-B
2) thread-A call terminate_all
3) thread-B start

but, it doesn't care following scenario.

1) thread-A create thread-B
2) thread-B start
(just after)
3) thread-A call terminate_all

Moreover, we shouldn't need a special care about thread starting. it should works as well.
I don't plan to apply your ugly bandaid patch.

Second, your analysis is not accurate. The complete opnebsd failed scenario is here.
(OMG, I needed to install openbsd).

thread-A thread-B

-> gvlyield
vm->gvl.wait
yield = 1;
-> schedyield
-> gvl
yield
-> condwait(switchwaitcond) # wait until waking up schedyield()
-> task-state change to CONDWAIT
<- sched
yield
vm->gvl.waityield = 0;
-> pthread
condbroadcast()
-> task-state of thread-B change to RUNNING
<- gvl
yield

-> gvlyield
vm->gvl.wait
yield = 1;
-> schedyield
<- cond
wait(switchwaitcond)

                                                   *** Oh, wait_yield is 1. try to sleep again.

                                                   -> cond_wait(switch_wait_cond)

That's all.
rbthreadterminateall() called rbthreadschedule() repeatedly. and waking-up condwait(switchwaitcond) failed
repeatedly. OpenBSD's user land pthread library makes semi deterministic scheduling. and then, the test code can't
bail out forever.

btw, schedyield() works as expected on OpenBSD. then, I don't change schedyield() call at least now.

#13 Updated by Motohiro KOSAKI over 2 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r33693.
Yui, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • threadpthread.c (gvlyield): don't prevent concurrent sched_yield(). [Bug #5130]

#14 Updated by Motohiro KOSAKI over 2 years ago

  • Tracker changed from Bug to Backport
  • Project changed from ruby-trunk to Backport93
  • Category deleted (core)
  • Status changed from Closed to Assigned
  • Assignee changed from Motohiro KOSAKI to Yuki Sonoda

Can anyone please review this patch?

#15 Updated by Jeremy Evans over 2 years ago

Motohiro KOSAKI wrote:

First, we can't apply your nor Eric's patch because it's racy. Your patch care following scenario.

1) thread-A create thread-B
2) thread-A call terminate_all
3) thread-B start

but, it doesn't care following scenario.

1) thread-A create thread-B
2) thread-B start
(just after)
3) thread-A call terminate_all

True. As my earlier message says, the patch only reduces the chance losing the race.

Moreover, we shouldn't need a special care about thread starting. it should works as well.
I don't plan to apply your ugly bandaid patch.

Fair enough. I certainly agree that my patch is an ugly bandaid.

Second, your analysis is not accurate. The complete opnebsd failed scenario is here.
(OMG, I needed to install openbsd).

thread-A thread-B

-> gvlyield
vm->gvl.wait
yield = 1;
-> schedyield
-> gvl
yield
-> condwait(switchwaitcond) # wait until waking up schedyield()
-> task-state change to CONDWAIT
<- sched
yield
vm->gvl.waityield = 0;
-> pthread
condbroadcast()
-> task-state of thread-B change to RUNNING
<- gvl
yield

-> gvlyield
vm->gvl.wait
yield = 1;
-> schedyield
<- cond
wait(switchwaitcond)

                                                   *** Oh, wait_yield is 1. try to sleep again.

                                                   -> cond_wait(switch_wait_cond)

That's all.
rbthreadterminateall() called rbthreadschedule() repeatedly. and waking-up condwait(switchwaitcond) failed
repeatedly. OpenBSD's user land pthread library makes semi deterministic scheduling. and then, the test code can't
bail out forever.

So the underlying problem is that wait_yield is always set to 1 when thread 2 wakes up, which causes thread 2 to sleep again? Is that a bug in ruby or OpenBSD?

btw, schedyield() works as expected on OpenBSD. then, I don't change schedyield() call at least now.

That's good.

Thank you for taking the time to look into the issue in more detail. If I can be of any help, please let me know.

#16 Updated by Jeremy Evans over 2 years ago

Motohiro KOSAKI wrote:

Can anyone please review this patch?

I'm not qualified to determine correctness, but I can confirm that it fixes the threading bootstrap test failure on OpenBSD amd64 and i386.

#17 Updated by Motohiro KOSAKI over 2 years ago

So the underlying problem is that wait_yield is always set to 1 when thread 2 wakes up, which causes thread 2 to sleep again? Is that a bug
in ruby or OpenBSD?

Difficult question. OpenBSD has a posix compliance pthread implementation. Ruby has reasonable OS assumption. But.... the result was pretty nasty unfortunate thing. Aghh.
Anyway, I don't think OpenBSD need to fix anything.

Thank you.

#18 Updated by Motohiro KOSAKI over 2 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r34179.
Yui, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 33693:

* thread_pthread.c (gvl_yield): don't prevent concurrent sched_yield().
  [Bug #5130] 

Also available in: Atom PDF