Bug #13794
closedInfinite loop of sched_yield
Description
I have been encountering an issue with processes hanging in an infinite loop of calling sched_yield(). The looping code can be found at https://github.com/ruby/ruby/blob/v2_3_4/thread_pthread.c#L1663
while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
native_thread_yield();
}
It is my belief that by some mechanism I have not been able to identify, timer_thread_pipe.writing is incremented but it never decremented, causing this loop to run infinitely.
I am not able to create a reproducible test case, however this issue occurs regularly in my production application. I have attached backtraces and thread lists from 2 processes exhibiting this behaviour. gdb confirms that timer_thread_pipe.writing = 1 in these processes.
I believe one possibility of the cause is that rb_thread_wakeup_timer_thread() or rb_thread_wakeup_timer_thread_low() is called, and before it returns, another thread calls fork(), leaving the value of timer_thread_pipe.writing incremented, but leaving behind the thread that would normally decrement it.
If this is correct, one solution would be to reset timer_thread_pipe.writing to 0 in native_reset_timer_thread() immediately after a fork.
Other examples of similar bugs being reported:
https://github.com/resque/resque/issues/578
https://github.com/zk-ruby/zk/issues/50
Files
Updated by normalperson (Eric Wong) over 7 years ago
Thanks for the report. I'll take a look at it in a few hours.
Updated by normalperson (Eric Wong) over 7 years ago
charlie@atech.media wrote:
Bug #13794: Infinite loop of sched_yield
https://bugs.ruby-lang.org/issues/13794I have been encountering an issue with processes hanging in an infinite loop of calling sched_yield(). The looping code can be found at https://github.com/ruby/ruby/blob/v2_3_4/thread_pthread.c#L1663
while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
native_thread_yield();
}It is my belief that by some mechanism I have not been able to identify, timer_thread_pipe.writing is incremented but it never decremented, causing this loop to run infinitely.
I am not able to create a reproducible test case, however this issue occurs regularly in my production application. I have attached backtraces and thread lists from 2 processes exhibiting this behaviour. gdb confirms that timer_thread_pipe.writing = 1 in these processes.
Can you also check the value of timer_thread_pipe.owner_process?
I believe one possibility of the cause is that rb_thread_wakeup_timer_thread() or rb_thread_wakeup_timer_thread_low() is called, and before it returns, another thread calls fork(), leaving the value of timer_thread_pipe.writing incremented, but leaving behind the thread that would normally decrement it.
That is a likely possibility.
If this is correct, one solution would be to reset timer_thread_pipe.writing to 0 in native_reset_timer_thread() immediately after a fork.
How about checking owner_process before incrementing?
Can you try the following patch to check owner_process?
https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw
timer_thread_pipe.writing was introduced in August 2015 with r51576,
so this bug would definitely be my fault.
Other examples of similar bugs being reported:
https://github.com/resque/resque/issues/578
https://github.com/zk-ruby/zk/issues/50
That also means these bugs from 2012 are from other causes.
Thanks again for this report.
Updated by normalperson (Eric Wong) over 7 years ago
Eric Wong wrote:
https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw
Also, a disclaimer: I've been far from my usual self lately.
Everything about my proposed patch could be wrong and bogus; too.
Extra eyes are, as always, very welcome.
Updated by catphish (Charlie Smurthwaite) over 7 years ago
Can you also check the value of timer_thread_pipe.owner_process?
I don't have any broken processes available right now, but I will check as soon as I can.
How about checking owner_process before incrementing?
I'm afraid this fix doesn't quite match up in my mind. To clarify, I am suggesting that timer_thread_pipe.writing is being incremented in the parent process before the fork occurs. This would still occur because the PID would match at that point.
Updated by catphish (Charlie Smurthwaite) over 7 years ago
How about checking owner_process before incrementing?
I'm afraid this fix doesn't quite match up in my mind. To clarify, I am suggesting that timer_thread_pipe.writing is being incremented in the parent process before the fork occurs. This would still occur because the PID would match at that point.
Checking it before the while loop might work though.
Updated by catphish (Charlie Smurthwaite) over 7 years ago
Can you also check the value of timer_thread_pipe.owner_process?
(gdb) print timer_thread_pipe.writing
$1 = 1
(gdb) print timer_thread_pipe.owner_process
$2 = 0
(gdb) info threads
Id Target Id Frame
2 Thread 0x7f1f98a2f700 (LWP 19597) "ruby-timer-thr" 0x00007f1f976e9c5d in poll ()
at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7f1f98a24740 (LWP 19595) "ruby" 0x00007f1f976c81d7 in sched_yield ()
at ../sysdeps/unix/syscall-template.S:81
Updated by catphish (Charlie Smurthwaite) about 7 years ago
I am now testing the following patch:
diff --git a/thread_pthread.c b/thread_pthread.c
index 4aa2d620a2..fe99524a54 100644
--- a/thread_pthread.c
+++ b/thread_pthread.c
@@ -1685,6 +1685,7 @@ native_stop_timer_thread(void)
static void
native_reset_timer_thread(void)
{
+ timer_thread_pipe.writing = 0;
if (TT_DEBUG) fprintf(stderr, "reset timer thread\n");
}
I don't know if this is the correct way to solve the problem, but I will update this thread when I know if it is effective or not. Even if this is the wrong way to solve the issue it will serve to provide more information about it.
Updated by catphish (Charlie Smurthwaite) about 7 years ago
- File sched_yield_1.patch added
Updated by catphish (Charlie Smurthwaite) about 7 years ago
- File deleted (
sched_yield_1.patch)
Updated by catphish (Charlie Smurthwaite) about 7 years ago
- File sched_yield_1.patch added
The patch above does not work because native_reset_timer_thread runs after fork in the parent. Attached an alternative patch that runs in gvl_atfork and appears not to run in the parent. Again I do not believe this is necessarily the correct way to fix this but I am also looking for something that works for me i the meantime.
Comments appreciated. Apologies for the mess of added and removed files.
Updated by catphish (Charlie Smurthwaite) about 7 years ago
- File deleted (
sched_yield_1.patch)
Updated by catphish (Charlie Smurthwaite) about 7 years ago
- File sched_yield_1.patch sched_yield_1.patch added
Updated by normalperson (Eric Wong) about 7 years ago
charlie@atech.media wrote:
File sched_yield_1.patch added
The patch above does not work because
native_reset_timer_thread runs after fork in the parent.
Attached an alternative patch that runs in gvl_atfork and
appears not to run in the parent. Again I do not believe this
is necessarily the correct way to fix this but I am also
looking for something that works for me i the meantime.
Agreed that it may not necessarily be an optimal way to fix the
problem; but have you been able to reproduce the problem with
your patch applied?
Comments appreciated.
Sorry, I've been too distracted this summer. Will attempt to
concentrate on it, now.
Updated by normalperson (Eric Wong) about 7 years ago
charlie@atech.media wrote:
Eric Wong normalperson@yhbt.net wrote:
How about checking owner_process before incrementing?
I'm afraid this fix doesn't quite match up in my mind. To
clarify, I am suggesting that timer_thread_pipe.writing is
being incremented in the parent process before the fork
occurs. This would still occur because the PID would match at
that point.
I am now more sure that checking .owner_process before
incrementing is sufficient. We zero .owner_process before
looping on the .writing check in native_stop_timer_thread, so if
sighandler fires after the .writing check loop in the parent; it
could be carried over to the child. Merely checking
.owner_process as in my original patch is enough to stop a
non-zero .writing value from being carried over to a child.
A stronger version which zeroes .writing when re-initializing the
timer thread pipe would be:
https://80x24.org/spew/20170828232657.GA22848@dcvr/raw
However, I think zero-ing .writing is unnecessary, so I stand
by my original patch:
https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw
Perhaps the following to ensure (.writing == 0) in
rb_thread_create_timer_thread will be useful, at least:
Updated by catphish (Charlie Smurthwaite) about 7 years ago
Apologies for my delay in replying. I have not yet had an opportunity to fully test any of these patches (apart from my initial hack which did not work). I will aim to test your patch as soon as possible. Thank you for your assistance!
Updated by catphish (Charlie Smurthwaite) about 7 years ago
Hi Eric,
I have been testing your original patch (just the PID check) for a couple of days and it appears to have resolved the problem. I will report on this again in 1 week as the issue occurs quite randomly but I am currently hopeful. Thank you very much again for looking into this.
Charlie
Updated by gregoryp (Gregory Potamianos) about 7 years ago
- File sched_yield_loop.rb sched_yield_loop.rb added
Hi,
We are also affected by this bug, running Debian's ruby 2.3.3p222. We are encountering it with Exec resources of puppet agents timing out and leaving stray processes on busyloop. We have tried to reproduce it but not with much success. The attached simple fork/exec script seems to reproduce it sporadically if run like
while true; do nice -n19 ruby sched_yield_loop.rb; [ $? -ne 0 ] && break; done
. It eventually raises a timeout error and leaves a child behind spinning on sched_yield.
catphish (Charlie Smurthwaite) wrote:
Hi Eric,
I have been testing your original patch (just the PID check) for a couple of days and it appears to have resolved the problem. I will report on this again in 1 week as the issue occurs quite randomly but I am currently hopeful. Thank you very much again for looking into this.
Charlie
We have been running ruby with this patch for 2 weeks now and it seems to solve the problem for us also. Is there any chance of this being merged/backported to ruby 2.3?
Regards,
Gregory
Updated by kernigh (George Koehler) about 7 years ago
Gregory Potamianos wrote:
while true; do nice -n19 ruby sched_yield_loop.rb; [ $? -ne 0 ] && break; done
With your script, it is easy to reproduce the bug.
I shortened your shell loop to
while nice -n19 ruby sched_yield_loop.rb; do done
I tested both patches by Eric Wong,
the weaker version with PID check
https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw
and the stronger version with PID check and zeroing .writing
I used this Ruby,
$ ruby -v
ruby 2.5.0dev (2017-09-30 trunk 60064) [x86_64-openbsd6.1]
The shell loop running sched_yield_loop.rb can run for a few minutes before the bug happens. It happens when sched_yield_loop.rb raises a timeout error; then I find a child Ruby spinning the CPU, as Gregory Potamianos described. Gregory, running Debian, reported that the weaker patch seems to fix the bug. I, running OpenBSD, observe that neither patch fixes the bug. I can still get the timeout error and the spinning child when Ruby is without patch, with the weaker patch, or with the stronger patch.
But I might have found a different bug. I did kill -ABRT a spinning child and gave the core dump to gdb; it seemed that both threads were stuck inside OpenBSD's thread library. The main thread was stuck in pthread_join(), and the other thread was stuck in _rthread_tls_destructors(). I did not find any thread stuck in the loop while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0))
identified by Charlie Smurthwaite in the original bug report.
Anyone can use Gregory's sched_yield_loop.rb to check for the bug. If the weaker patch from Eric Wong fixes the bug for Linux, I suggest to put the weaker patch in trunk, and to backport it to older Ruby versions.
Updated by normalperson (Eric Wong) about 7 years ago
- Backport changed from 2.2: UNKNOWN, 2.3: UNKNOWN, 2.4: UNKNOWN to 2.2: UNKNOWN, 2.3: REQUIRED, 2.4: REQUIRED
Updated by Anonymous about 7 years ago
- Status changed from Open to Closed
Applied in changeset trunk|r60079.
thread_pthread.c: do not wakeup inside child processes
- thread_pthread.c (rb_thread_wakeup_timer_thread): check
ownership before incrementing
(rb_thread_wakeup_timer_thread_low): ditto
[Bug #13794] [ruby-core:83064]
Updated by normalperson (Eric Wong) about 7 years ago
xkernigh@netscape.net wrote:
Anyone can use Gregory's sched_yield_loop.rb to check for the
bug. If the weaker patch from Eric Wong fixes the bug for
Linux, I suggest to put the weaker patch in trunk, and to
backport it to older Ruby versions.
Thanks to all you for the feedback, I've commited my original patcha
as r60079.
But I might have found a different bug. I did kill -ABRT a
spinning child and gave the core dump to gdb; it seemed that
both threads were stuck inside OpenBSD's thread library. The
main thread was stuck in pthread_join(), and the other thread
was stuck in _rthread_tls_destructors(). I did not find any
thread stuck in the loopwhile (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0))
identified by Charlie Smurthwaite in the original bug report.
Yes, this seems like a different bug. Maybe OpenBSD pthreads
doesn't work well with fork/vfork (glibc barely does), and
that's a bug for OpenBSD guys to fix.
Updated by normalperson (Eric Wong) about 7 years ago
xkernigh@netscape.net wrote:
But I might have found a different bug. I did kill -ABRT a
spinning child and gave the core dump to gdb; it seemed that
both threads were stuck inside OpenBSD's thread library. The
main thread was stuck in pthread_join(), and the other thread
was stuck in _rthread_tls_destructors(). I did not find any
thread stuck in the loopwhile (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0))
identified by Charlie Smurthwaite in the original bug report.
Oh, it might be related to wanabe's patch for [Bug #13887]
for systems with FIBER_USE_NATIVE=0
https://bugs.ruby-lang.org/issues/13887
https://80x24.org/spew/20170928004228.4538-2-e@80x24.org/raw
Updated by catphish (Charlie Smurthwaite) about 7 years ago
I'd just like to confirm that after several weeks, I have not seen a recurrence of this issue while running the original PID check patch. Thanks all!
Updated by catphish (Charlie Smurthwaite) almost 7 years ago
I notice that this bug has been closed for a while but has not been backported into 2.3. Is this likely to happen? Thanks!
Updated by nagachika (Tomoyuki Chikanaga) almost 7 years ago
- Backport changed from 2.2: UNKNOWN, 2.3: REQUIRED, 2.4: REQUIRED to 2.2: UNKNOWN, 2.3: REQUIRED, 2.4: DONE
ruby_2_4 r61854 merged revision(s) 60079.
Updated by usa (Usaku NAKAMURA) almost 7 years ago
- Backport changed from 2.2: UNKNOWN, 2.3: REQUIRED, 2.4: DONE to 2.2: UNKNOWN, 2.3: DONE, 2.4: DONE
ruby_2_3 r62142 merged revision(s) 60079.