Project

General

Profile

Bug #13794

Infinite loop of sched_yield

Added by catphish (Charlie Smurthwaite) 4 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.3.4p301 (2017-03-30 revision 58214) [x86_64-linux]
[ruby-core:82311]

Description

I have been encountering an issue with processes hanging in an infinite loop of calling sched_yield(). The looping code can be found at https://github.com/ruby/ruby/blob/v2_3_4/thread_pthread.c#L1663

while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
native_thread_yield();
}

It is my belief that by some mechanism I have not been able to identify, timer_thread_pipe.writing is incremented but it never decremented, causing this loop to run infinitely.

I am not able to create a reproducible test case, however this issue occurs regularly in my production application. I have attached backtraces and thread lists from 2 processes exhibiting this behaviour. gdb confirms that timer_thread_pipe.writing = 1 in these processes.

I believe one possibility of the cause is that rb_thread_wakeup_timer_thread() or rb_thread_wakeup_timer_thread_low() is called, and before it returns, another thread calls fork(), leaving the value of timer_thread_pipe.writing incremented, but leaving behind the thread that would normally decrement it.

If this is correct, one solution would be to reset timer_thread_pipe.writing to 0 in native_reset_timer_thread() immediately after a fork.

Other examples of similar bugs being reported:
https://github.com/resque/resque/issues/578
https://github.com/zk-ruby/zk/issues/50

backtrace_1.txt (14 KB) backtrace_1.txt catphish (Charlie Smurthwaite), 08/09/2017 03:17 PM
backtrace_2.txt (10.9 KB) backtrace_2.txt catphish (Charlie Smurthwaite), 08/09/2017 03:17 PM
sched_yield_1.patch (738 Bytes) sched_yield_1.patch catphish (Charlie Smurthwaite), 08/17/2017 03:09 PM
sched_yield_loop.rb (212 Bytes) sched_yield_loop.rb gregoryp (Gregory Potamianos), 09/29/2017 10:16 AM

Associated revisions

Revision 60079
Added by normal about 2 months ago

thread_pthread.c: do not wakeup inside child processes

  • thread_pthread.c (rb_thread_wakeup_timer_thread): check ownership before incrementing (rb_thread_wakeup_timer_thread_low): ditto [Bug #13794]

History

#1 [ruby-core:82314] Updated by normalperson (Eric Wong) 4 months ago

Thanks for the report. I'll take a look at it in a few hours.

#2 [ruby-core:82319] Updated by normalperson (Eric Wong) 4 months ago

charlie@atech.media wrote:

Bug #13794: Infinite loop of sched_yield

https://bugs.ruby-lang.org/issues/13794

I have been encountering an issue with processes hanging in an infinite loop of calling sched_yield(). The looping code can be found at https://github.com/ruby/ruby/blob/v2_3_4/thread_pthread.c#L1663

while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
native_thread_yield();
}

It is my belief that by some mechanism I have not been able to identify, timer_thread_pipe.writing is incremented but it never decremented, causing this loop to run infinitely.

I am not able to create a reproducible test case, however this issue occurs regularly in my production application. I have attached backtraces and thread lists from 2 processes exhibiting this behaviour. gdb confirms that timer_thread_pipe.writing = 1 in these processes.

Can you also check the value of timer_thread_pipe.owner_process?

I believe one possibility of the cause is that rb_thread_wakeup_timer_thread() or rb_thread_wakeup_timer_thread_low() is called, and before it returns, another thread calls fork(), leaving the value of timer_thread_pipe.writing incremented, but leaving behind the thread that would normally decrement it.

That is a likely possibility.

If this is correct, one solution would be to reset timer_thread_pipe.writing to 0 in native_reset_timer_thread() immediately after a fork.

How about checking owner_process before incrementing?
Can you try the following patch to check owner_process?

https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

timer_thread_pipe.writing was introduced in August 2015 with r51576,
so this bug would definitely be my fault.

Other examples of similar bugs being reported:
https://github.com/resque/resque/issues/578
https://github.com/zk-ruby/zk/issues/50

That also means these bugs from 2012 are from other causes.

Thanks again for this report.

#3 [ruby-core:82320] Updated by normalperson (Eric Wong) 4 months ago

Eric Wong wrote:

https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

Also, a disclaimer: I've been far from my usual self lately.
Everything about my proposed patch could be wrong and bogus; too.

Extra eyes are, as always, very welcome.

#4 [ruby-core:82321] Updated by catphish (Charlie Smurthwaite) 4 months ago

Can you also check the value of timer_thread_pipe.owner_process?

I don't have any broken processes available right now, but I will check as soon as I can.

How about checking owner_process before incrementing?

I'm afraid this fix doesn't quite match up in my mind. To clarify, I am suggesting that timer_thread_pipe.writing is being incremented in the parent process before the fork occurs. This would still occur because the PID would match at that point.

#5 [ruby-core:82322] Updated by catphish (Charlie Smurthwaite) 4 months ago

How about checking owner_process before incrementing?

I'm afraid this fix doesn't quite match up in my mind. To clarify, I am suggesting that timer_thread_pipe.writing is being incremented in the parent process before the fork occurs. This would still occur because the PID would match at that point.

Checking it before the while loop might work though.

#6 [ruby-core:82386] Updated by catphish (Charlie Smurthwaite) 3 months ago

Can you also check the value of timer_thread_pipe.owner_process?

(gdb) print timer_thread_pipe.writing
$1 = 1
(gdb) print timer_thread_pipe.owner_process
$2 = 0

(gdb) info threads
  Id   Target Id         Frame 
  2    Thread 0x7f1f98a2f700 (LWP 19597) "ruby-timer-thr" 0x00007f1f976e9c5d in poll ()
    at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7f1f98a24740 (LWP 19595) "ruby" 0x00007f1f976c81d7 in sched_yield ()
    at ../sysdeps/unix/syscall-template.S:81

#7 [ruby-core:82414] Updated by catphish (Charlie Smurthwaite) 3 months ago

I am now testing the following patch:

diff --git a/thread_pthread.c b/thread_pthread.c
index 4aa2d620a2..fe99524a54 100644
--- a/thread_pthread.c
+++ b/thread_pthread.c
@@ -1685,6 +1685,7 @@ native_stop_timer_thread(void)
 static void
 native_reset_timer_thread(void)
 {
+    timer_thread_pipe.writing = 0;
     if (TT_DEBUG)  fprintf(stderr, "reset timer thread\n");
 }

I don't know if this is the correct way to solve the problem, but I will update this thread when I know if it is effective or not. Even if this is the wrong way to solve the issue it will serve to provide more information about it.

#8 [ruby-core:82417] Updated by catphish (Charlie Smurthwaite) 3 months ago

  • File sched_yield_1.patch added

#9 Updated by catphish (Charlie Smurthwaite) 3 months ago

  • File deleted (sched_yield_1.patch)

#10 [ruby-core:82418] Updated by catphish (Charlie Smurthwaite) 3 months ago

  • File sched_yield_1.patch added

The patch above does not work because native_reset_timer_thread runs after fork in the parent. Attached an alternative patch that runs in gvl_atfork and appears not to run in the parent. Again I do not believe this is necessarily the correct way to fix this but I am also looking for something that works for me i the meantime.

Comments appreciated. Apologies for the mess of added and removed files.

#11 Updated by catphish (Charlie Smurthwaite) 3 months ago

  • File deleted (sched_yield_1.patch)

#13 [ruby-core:82452] Updated by normalperson (Eric Wong) 3 months ago

charlie@atech.media wrote:

File sched_yield_1.patch added

The patch above does not work because
native_reset_timer_thread runs after fork in the parent.
Attached an alternative patch that runs in gvl_atfork and
appears not to run in the parent. Again I do not believe this
is necessarily the correct way to fix this but I am also
looking for something that works for me i the meantime.

Agreed that it may not necessarily be an optimal way to fix the
problem; but have you been able to reproduce the problem with
your patch applied?

Comments appreciated.

Sorry, I've been too distracted this summer. Will attempt to
concentrate on it, now.

#14 [ruby-core:82495] Updated by normalperson (Eric Wong) 3 months ago

charlie@atech.media wrote:

Eric Wong normalperson@yhbt.net wrote:

How about checking owner_process before incrementing?

I'm afraid this fix doesn't quite match up in my mind. To
clarify, I am suggesting that timer_thread_pipe.writing is
being incremented in the parent process before the fork
occurs. This would still occur because the PID would match at
that point.

I am now more sure that checking .owner_process before
incrementing is sufficient. We zero .owner_process before
looping on the .writing check in native_stop_timer_thread, so if
sighandler fires after the .writing check loop in the parent; it
could be carried over to the child. Merely checking
.owner_process as in my original patch is enough to stop a
non-zero .writing value from being carried over to a child.

A stronger version which zeroes .writing when re-initializing the
timer thread pipe would be:

https://80x24.org/spew/20170828232657.GA22848@dcvr/raw

However, I think zero-ing .writing is unnecessary, so I stand
by my original patch:

https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

Perhaps the following to ensure (.writing == 0) in
rb_thread_create_timer_thread will be useful, at least:

https://80x24.org/spew/20170828234743.GA27737@dcvr/raw

#15 [ruby-core:82510] Updated by catphish (Charlie Smurthwaite) 3 months ago

Apologies for my delay in replying. I have not yet had an opportunity to fully test any of these patches (apart from my initial hack which did not work). I will aim to test your patch as soon as possible. Thank you for your assistance!

#16 [ruby-core:82626] Updated by catphish (Charlie Smurthwaite) 3 months ago

Hi Eric,

I have been testing your original patch (just the PID check) for a couple of days and it appears to have resolved the problem. I will report on this again in 1 week as the issue occurs quite randomly but I am currently hopeful. Thank you very much again for looking into this.

Charlie

#17 [ruby-core:83061] Updated by gregoryp (Gregory Potamianos) about 2 months ago

Hi,

We are also affected by this bug, running Debian's ruby 2.3.3p222. We are encountering it with Exec resources of puppet agents timing out and leaving stray processes on busyloop. We have tried to reproduce it but not with much success. The attached simple fork/exec script seems to reproduce it sporadically if run like

while true; do nice -n19 ruby sched_yield_loop.rb; [ $? -ne 0 ] && break; done

. It eventually raises a timeout error and leaves a child behind spinning on sched_yield.

catphish (Charlie Smurthwaite) wrote:

Hi Eric,

I have been testing your original patch (just the PID check) for a couple of days and it appears to have resolved the problem. I will report on this again in 1 week as the issue occurs quite randomly but I am currently hopeful. Thank you very much again for looking into this.

Charlie

We have been running ruby with this patch for 2 weeks now and it seems to solve the problem for us also. Is there any chance of this being merged/backported to ruby 2.3?

Regards,
Gregory

#18 [ruby-core:83064] Updated by kernigh (George Koehler) about 2 months ago

Gregory Potamianos wrote:

while true; do nice -n19 ruby sched_yield_loop.rb; [ $? -ne 0 ] && break; done

With your script, it is easy to reproduce the bug.
I shortened your shell loop to
while nice -n19 ruby sched_yield_loop.rb; do done

I tested both patches by Eric Wong,
the weaker version with PID check

https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

and the stronger version with PID check and zeroing .writing

https://80x24.org/spew/20170828232657.GA22848@dcvr/raw

I used this Ruby,

$ ruby -v
ruby 2.5.0dev (2017-09-30 trunk 60064) [x86_64-openbsd6.1]

The shell loop running sched_yield_loop.rb can run for a few minutes before the bug happens. It happens when sched_yield_loop.rb raises a timeout error; then I find a child Ruby spinning the CPU, as Gregory Potamianos described. Gregory, running Debian, reported that the weaker patch seems to fix the bug. I, running OpenBSD, observe that neither patch fixes the bug. I can still get the timeout error and the spinning child when Ruby is without patch, with the weaker patch, or with the stronger patch.

But I might have found a different bug. I did kill -ABRT a spinning child and gave the core dump to gdb; it seemed that both threads were stuck inside OpenBSD's thread library. The main thread was stuck in pthread_join(), and the other thread was stuck in _rthread_tls_destructors(). I did not find any thread stuck in the loop while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) identified by Charlie Smurthwaite in the original bug report.

Anyone can use Gregory's sched_yield_loop.rb to check for the bug. If the weaker patch from Eric Wong fixes the bug for Linux, I suggest to put the weaker patch in trunk, and to backport it to older Ruby versions.

#19 Updated by normalperson (Eric Wong) about 2 months ago

  • Backport changed from 2.2: UNKNOWN, 2.3: UNKNOWN, 2.4: UNKNOWN to 2.2: UNKNOWN, 2.3: REQUIRED, 2.4: REQUIRED

#20 Updated by Anonymous about 2 months ago

  • Status changed from Open to Closed

Applied in changeset trunk|r60079.


thread_pthread.c: do not wakeup inside child processes

  • thread_pthread.c (rb_thread_wakeup_timer_thread): check ownership before incrementing (rb_thread_wakeup_timer_thread_low): ditto [Bug #13794]

#21 [ruby-core:83067] Updated by normalperson (Eric Wong) about 2 months ago

xkernigh@netscape.net wrote:

Anyone can use Gregory's sched_yield_loop.rb to check for the
bug. If the weaker patch from Eric Wong fixes the bug for
Linux, I suggest to put the weaker patch in trunk, and to
backport it to older Ruby versions.

Thanks to all you for the feedback, I've commited my original patcha
as r60079.

But I might have found a different bug. I did kill -ABRT a
spinning child and gave the core dump to gdb; it seemed that
both threads were stuck inside OpenBSD's thread library. The
main thread was stuck in pthread_join(), and the other thread
was stuck in _rthread_tls_destructors(). I did not find any
thread stuck in the loop while
(ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0))

identified by Charlie Smurthwaite in the original bug report.

Yes, this seems like a different bug. Maybe OpenBSD pthreads
doesn't work well with fork/vfork (glibc barely does), and
that's a bug for OpenBSD guys to fix.

#22 [ruby-core:83069] Updated by normalperson (Eric Wong) about 2 months ago

xkernigh@netscape.net wrote:

But I might have found a different bug. I did kill -ABRT a
spinning child and gave the core dump to gdb; it seemed that
both threads were stuck inside OpenBSD's thread library. The
main thread was stuck in pthread_join(), and the other thread
was stuck in _rthread_tls_destructors(). I did not find any
thread stuck in the loop while
(ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0))

identified by Charlie Smurthwaite in the original bug report.

Oh, it might be related to wanabe's patch for [Bug #13887]
for systems with FIBER_USE_NATIVE=0
https://bugs.ruby-lang.org/issues/13887

https://80x24.org/spew/20170928004228.4538-2-e@80x24.org/raw

#23 [ruby-core:83392] Updated by catphish (Charlie Smurthwaite) about 1 month ago

I'd just like to confirm that after several weeks, I have not seen a recurrence of this issue while running the original PID check patch. Thanks all!

Also available in: Atom PDF