https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112023-05-18T23:42:49ZRuby Issue Tracking SystemRuby master - Bug #19680: test_process.rb tests fail sometimes on FreeBSDhttps://bugs.ruby-lang.org/issues/19680?journal_id=1031462023-05-18T23:42:49Zkjtsanaktsidis (KJ Tsanaktsidis)kjtsanaktsidis@gmail.com
<ul></ul><p>Apologies, I accidentally submitted the issue while I was still writing it, just editing the issue to provide some detail now... (OK - done now)</p> Ruby master - Bug #19680: test_process.rb tests fail sometimes on FreeBSDhttps://bugs.ruby-lang.org/issues/19680?journal_id=1031482023-05-19T00:26:45Zkjtsanaktsidis (KJ Tsanaktsidis)kjtsanaktsidis@gmail.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/103148/diff?detail_id=64907">diff</a>)</li></ul> Ruby master - Bug #19680: test_process.rb tests fail sometimes on FreeBSDhttps://bugs.ruby-lang.org/issues/19680?journal_id=1032142023-05-22T00:23:50Zkjtsanaktsidis (KJ Tsanaktsidis)kjtsanaktsidis@gmail.com
<ul></ul><p>I made a fair bit more progress over the weekend with this:</p>
<ul>
<li>For the <code>TestProcess#test_daemon_no_threads</code> test - I updated my patch slightly: <a href="https://github.com/ruby/ruby/commit/2f306cbd15de9899906a563012c92fd02b805300" class="external">https://github.com/ruby/ruby/commit/2f306cbd15de9899906a563012c92fd02b805300</a>
</li>
<li>For the bug in FreeBSD - It really is a bug, and the FreeBSD developers have patched it - <a href="https://reviews.freebsd.org/D40178" class="external">https://reviews.freebsd.org/D40178</a>. I also discovered that it can be worked around with <code>LD_BIND_NOW</code>, so this might be a good thing to add to the test runner for FreeBSD for existing versions.</li>
</ul>
<p>I also found a similar issue in the <code>TestIO#test_race_gets_and_close</code> test. This test can hang or take a very long time to finish, because:</p>
<ul>
<li>The main thread goes to close one of the pipes here: <code>https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/test/ruby/test_io.rb#L3838</code>
</li>
<li>The other threads are currently using those files here: <code>https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/test/ruby/test_io.rb#L3829</code>
</li>
<li>So, when closing the pipe, the main thread calls <code>rb_notify_fd_close</code> here: <a href="https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/io.c#L5643" class="external">https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/io.c#L5643</a>. This builds up a list of other threads currently using the pipe we're trying to close</li>
<li>It later waits for that list to become empty here: <a href="https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/io.c#L5470" class="external">https://github.com/ruby/ruby/blob/872249e209fdb7b7c890a93b0f93a74a62d21aec/io.c#L5470</a> by calling <code>rb_thread_schedule()</code> in a loop until it is.</li>
<li>
<code>rb_thread_schedule</code> eventually winds up whacking the other threads that are trying to leave the blocking region of <code>gets</code> with SIGVTALRM.</li>
<li>It seems on FreeBSD, those threads very often wind up not being ready in time and <code>rb_thread_schedule</code> ends up with nothing to run but the main thread again</li>
<li>Which kicks off this whole loop again</li>
</ul>
<p>I thought it was suspicious that this thread has a 200 second timeout on it; there is no need for this to take that long.</p>
<p>This patch stops the busy-waiting through <code>rb_thread_schedule</code> and instead uses a dedicated condition variable which is woken up whenever one of the <code>gets</code> threads is done with the file descriptor that's attempting to be closed: <a href="https://github.com/ruby/ruby/commit/a17304b3c4eeffbe945c8d0d4555c096c1183045" class="external">https://github.com/ruby/ruby/commit/a17304b3c4eeffbe945c8d0d4555c096c1183045</a></p>
<p>After applying this patch, <code>TestIO#test_race_gets_and_close</code> passes in a matter of milliseconds, not hundreds of seconds.</p>
<a name="Next-steps"></a>
<h2 >Next steps<a href="#Next-steps" class="wiki-anchor">¶</a></h2>
<ul>
<li>I ran the test suite in a loop for 12 hours, 2x in parallel, on my FreeBSD VM - no failures at all! of any kind!</li>
<li>Now I need to do the same thing to at least Linux, MacOS, and Windows, since these are generic changes to <code>thread.c</code> that will affect all platforms.</li>
<li>There are also a few other tests skipped on FreeBSD, I will review these and see if my patches fix them.</li>
<li>If that comes up OK, I'll tidy up my two patches & submit PR's for them</li>
<li>I'll also submit a PR to make the test runner use <code>LD_BIND_NOW</code> on FreeBSD < 13.3</li>
</ul> Ruby master - Bug #19680: test_process.rb tests fail sometimes on FreeBSDhttps://bugs.ruby-lang.org/issues/19680?journal_id=1033142023-05-26T06:50:38Zkjtsanaktsidis (KJ Tsanaktsidis)kjtsanaktsidis@gmail.com
<ul></ul><p>OK so <a href="https://github.com/ruby/ruby/pull/7864" class="external">https://github.com/ruby/ruby/pull/7864</a> and <a href="https://github.com/ruby/ruby/pull/7865" class="external">https://github.com/ruby/ruby/pull/7865</a> were merged, so this <em>should</em> be fixed. I'll keep an eye out on the CI tests over the weekend and see if this clears things up.</p>
<p>I also have <a href="https://github.com/ruby/ruby/pull/7867" class="external">https://github.com/ruby/ruby/pull/7867</a> open which works around the freebsd bug I found but that's probably less critical.</p> Ruby master - Bug #19680: test_process.rb tests fail sometimes on FreeBSDhttps://bugs.ruby-lang.org/issues/19680?journal_id=1033452023-05-30T01:42:42Zkjtsanaktsidis (KJ Tsanaktsidis)kjtsanaktsidis@gmail.com
<ul></ul><p>FreeBSD 13.1 CI hasn't failed since these fixes were merged so <em>touch wood</em> I think we can call this done.</p> Ruby master - Bug #19680: test_process.rb tests fail sometimes on FreeBSDhttps://bugs.ruby-lang.org/issues/19680?journal_id=1033462023-05-30T01:44:26Zioquatix (Samuel Williams)samuel@oriontransfer.net
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Closed per request.</p>