Project

General

Profile

Bug #13794

Infinite loop of sched_yield

Added by catphish (Charlie Smurthwaite) about 2 months ago. Updated 22 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.3.4p301 (2017-03-30 revision 58214) [x86_64-linux]
[ruby-core:82311]

Description

I have been encountering an issue with processes hanging in an infinite loop of calling sched_yield(). The looping code can be found at https://github.com/ruby/ruby/blob/v2_3_4/thread_pthread.c#L1663

while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
native_thread_yield();
}

It is my belief that by some mechanism I have not been able to identify, timer_thread_pipe.writing is incremented but it never decremented, causing this loop to run infinitely.

I am not able to create a reproducible test case, however this issue occurs regularly in my production application. I have attached backtraces and thread lists from 2 processes exhibiting this behaviour. gdb confirms that timer_thread_pipe.writing = 1 in these processes.

I believe one possibility of the cause is that rb_thread_wakeup_timer_thread() or rb_thread_wakeup_timer_thread_low() is called, and before it returns, another thread calls fork(), leaving the value of timer_thread_pipe.writing incremented, but leaving behind the thread that would normally decrement it.

If this is correct, one solution would be to reset timer_thread_pipe.writing to 0 in native_reset_timer_thread() immediately after a fork.

Other examples of similar bugs being reported:
https://github.com/resque/resque/issues/578
https://github.com/zk-ruby/zk/issues/50

backtrace_1.txt (14 KB) backtrace_1.txt catphish (Charlie Smurthwaite), 08/09/2017 03:17 PM
backtrace_2.txt (10.9 KB) backtrace_2.txt catphish (Charlie Smurthwaite), 08/09/2017 03:17 PM
sched_yield_1.patch (738 Bytes) sched_yield_1.patch catphish (Charlie Smurthwaite), 08/17/2017 03:09 PM

History

#1 [ruby-core:82314] Updated by normalperson (Eric Wong) about 2 months ago

Thanks for the report. I'll take a look at it in a few hours.

#2 [ruby-core:82319] Updated by normalperson (Eric Wong) about 1 month ago

charlie@atech.media wrote:

Bug #13794: Infinite loop of sched_yield

https://bugs.ruby-lang.org/issues/13794

I have been encountering an issue with processes hanging in an infinite loop of calling sched_yield(). The looping code can be found at https://github.com/ruby/ruby/blob/v2_3_4/thread_pthread.c#L1663

while (ATOMIC_CAS(timer_thread_pipe.writing, (rb_atomic_t)0, 0)) {
native_thread_yield();
}

It is my belief that by some mechanism I have not been able to identify, timer_thread_pipe.writing is incremented but it never decremented, causing this loop to run infinitely.

I am not able to create a reproducible test case, however this issue occurs regularly in my production application. I have attached backtraces and thread lists from 2 processes exhibiting this behaviour. gdb confirms that timer_thread_pipe.writing = 1 in these processes.

Can you also check the value of timer_thread_pipe.owner_process?

I believe one possibility of the cause is that rb_thread_wakeup_timer_thread() or rb_thread_wakeup_timer_thread_low() is called, and before it returns, another thread calls fork(), leaving the value of timer_thread_pipe.writing incremented, but leaving behind the thread that would normally decrement it.

That is a likely possibility.

If this is correct, one solution would be to reset timer_thread_pipe.writing to 0 in native_reset_timer_thread() immediately after a fork.

How about checking owner_process before incrementing?
Can you try the following patch to check owner_process?

https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

timer_thread_pipe.writing was introduced in August 2015 with r51576,
so this bug would definitely be my fault.

Other examples of similar bugs being reported:
https://github.com/resque/resque/issues/578
https://github.com/zk-ruby/zk/issues/50

That also means these bugs from 2012 are from other causes.

Thanks again for this report.

#3 [ruby-core:82320] Updated by normalperson (Eric Wong) about 1 month ago

Eric Wong wrote:

https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

Also, a disclaimer: I've been far from my usual self lately.
Everything about my proposed patch could be wrong and bogus; too.

Extra eyes are, as always, very welcome.

#4 [ruby-core:82321] Updated by catphish (Charlie Smurthwaite) about 1 month ago

Can you also check the value of timer_thread_pipe.owner_process?

I don't have any broken processes available right now, but I will check as soon as I can.

How about checking owner_process before incrementing?

I'm afraid this fix doesn't quite match up in my mind. To clarify, I am suggesting that timer_thread_pipe.writing is being incremented in the parent process before the fork occurs. This would still occur because the PID would match at that point.

#5 [ruby-core:82322] Updated by catphish (Charlie Smurthwaite) about 1 month ago

How about checking owner_process before incrementing?

I'm afraid this fix doesn't quite match up in my mind. To clarify, I am suggesting that timer_thread_pipe.writing is being incremented in the parent process before the fork occurs. This would still occur because the PID would match at that point.

Checking it before the while loop might work though.

#6 [ruby-core:82386] Updated by catphish (Charlie Smurthwaite) about 1 month ago

Can you also check the value of timer_thread_pipe.owner_process?

(gdb) print timer_thread_pipe.writing
$1 = 1
(gdb) print timer_thread_pipe.owner_process
$2 = 0

(gdb) info threads
  Id   Target Id         Frame 
  2    Thread 0x7f1f98a2f700 (LWP 19597) "ruby-timer-thr" 0x00007f1f976e9c5d in poll ()
    at ../sysdeps/unix/syscall-template.S:81
* 1    Thread 0x7f1f98a24740 (LWP 19595) "ruby" 0x00007f1f976c81d7 in sched_yield ()
    at ../sysdeps/unix/syscall-template.S:81

#7 [ruby-core:82414] Updated by catphish (Charlie Smurthwaite) about 1 month ago

I am now testing the following patch:

diff --git a/thread_pthread.c b/thread_pthread.c
index 4aa2d620a2..fe99524a54 100644
--- a/thread_pthread.c
+++ b/thread_pthread.c
@@ -1685,6 +1685,7 @@ native_stop_timer_thread(void)
 static void
 native_reset_timer_thread(void)
 {
+    timer_thread_pipe.writing = 0;
     if (TT_DEBUG)  fprintf(stderr, "reset timer thread\n");
 }

I don't know if this is the correct way to solve the problem, but I will update this thread when I know if it is effective or not. Even if this is the wrong way to solve the issue it will serve to provide more information about it.

#8 [ruby-core:82417] Updated by catphish (Charlie Smurthwaite) about 1 month ago

  • File sched_yield_1.patch added

#9 Updated by catphish (Charlie Smurthwaite) about 1 month ago

  • File deleted (sched_yield_1.patch)

#10 [ruby-core:82418] Updated by catphish (Charlie Smurthwaite) about 1 month ago

  • File sched_yield_1.patch added

The patch above does not work because native_reset_timer_thread runs after fork in the parent. Attached an alternative patch that runs in gvl_atfork and appears not to run in the parent. Again I do not believe this is necessarily the correct way to fix this but I am also looking for something that works for me i the meantime.

Comments appreciated. Apologies for the mess of added and removed files.

#11 Updated by catphish (Charlie Smurthwaite) about 1 month ago

  • File deleted (sched_yield_1.patch)

#13 [ruby-core:82452] Updated by normalperson (Eric Wong) about 1 month ago

charlie@atech.media wrote:

File sched_yield_1.patch added

The patch above does not work because
native_reset_timer_thread runs after fork in the parent.
Attached an alternative patch that runs in gvl_atfork and
appears not to run in the parent. Again I do not believe this
is necessarily the correct way to fix this but I am also
looking for something that works for me i the meantime.

Agreed that it may not necessarily be an optimal way to fix the
problem; but have you been able to reproduce the problem with
your patch applied?

Comments appreciated.

Sorry, I've been too distracted this summer. Will attempt to
concentrate on it, now.

#14 [ruby-core:82495] Updated by normalperson (Eric Wong) 26 days ago

charlie@atech.media wrote:

Eric Wong normalperson@yhbt.net wrote:

How about checking owner_process before incrementing?

I'm afraid this fix doesn't quite match up in my mind. To
clarify, I am suggesting that timer_thread_pipe.writing is
being incremented in the parent process before the fork
occurs. This would still occur because the PID would match at
that point.

I am now more sure that checking .owner_process before
incrementing is sufficient. We zero .owner_process before
looping on the .writing check in native_stop_timer_thread, so if
sighandler fires after the .writing check loop in the parent; it
could be carried over to the child. Merely checking
.owner_process as in my original patch is enough to stop a
non-zero .writing value from being carried over to a child.

A stronger version which zeroes .writing when re-initializing the
timer thread pipe would be:

https://80x24.org/spew/20170828232657.GA22848@dcvr/raw

However, I think zero-ing .writing is unnecessary, so I stand
by my original patch:

https://80x24.org/spew/20170809232533.14932-1-e@80x24.org/raw

Perhaps the following to ensure (.writing == 0) in
rb_thread_create_timer_thread will be useful, at least:

https://80x24.org/spew/20170828234743.GA27737@dcvr/raw

#15 [ruby-core:82510] Updated by catphish (Charlie Smurthwaite) 25 days ago

Apologies for my delay in replying. I have not yet had an opportunity to fully test any of these patches (apart from my initial hack which did not work). I will aim to test your patch as soon as possible. Thank you for your assistance!

#16 [ruby-core:82626] Updated by catphish (Charlie Smurthwaite) 22 days ago

Hi Eric,

I have been testing your original patch (just the PID check) for a couple of days and it appears to have resolved the problem. I will report on this again in 1 week as the issue occurs quite randomly but I am currently hopeful. Thank you very much again for looking into this.

Charlie

Also available in: Atom PDF