https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112018-08-16T13:00:36ZRuby Issue Tracking SystemRuby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735612018-08-16T13:00:36ZEregon (Benoit Daloze)
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/73561/diff?detail_id=49713">diff</a>)</li></ul> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735622018-08-16T20:12:28Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p>Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: ConditionVariable doesn't reacquire the Mutex if Thread#kill-ed (Closed)" href="https://bugs.ruby-lang.org/issues/14999">#14999</a>: ConditionVariable doesn't require the Mutex if Thread#kill-ed<br>
<a href="https://bugs.ruby-lang.org/issues/14999" class="external">https://bugs.ruby-lang.org/issues/14999</a></p>
</blockquote>
<p>r64398</p>
<p>Thank you; it was actually a regression introduced while fixing<br>
[Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Very rarely IO#readpartial does not raise EOFError (Closed)" href="https://bugs.ruby-lang.org/issues/14841">#14841</a>]. I was going to investigate this today (same as<br>
<a href="https://blade.ruby-lang.org/ruby-core/88498">[ruby-core:88498]</a>), anyways; but you saved me a bunch of time.</p>
<p>Btw, I haven't come up with a better reproducer for [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Very rarely IO#readpartial does not raise EOFError (Closed)" href="https://bugs.ruby-lang.org/issues/14841">#14841</a>]<br>
in the test suite. It takes forever :/</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735662018-08-16T23:57:08ZAnonymous
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Applied in changeset commit:ruby-git|2cf3bd5bb2a7c4724e528577d37a883fe80a1122.</p>
<hr>
<p>thread_sync.c (rb_mutex_lock): acquire lock before being killed</p>
<p>We (the thread acquiring the mutex) need to acquire the mutex<br>
before being killed to work with ConditionVariable#wait.</p>
<p>Thus we reinstate the acquire-immediately-after-sleeping logic<br>
from pre-r63711 while still retaining the<br>
acquire-after-checking-for-interrupts logic from r63711.</p>
<p>This regression was introduced in<br>
commit 501069b8a4013f2e3fdde35c50e9527ef0061963 (r63711)<br>
("thread_sync.c (rb_mutex_lock): fix deadlock") for<br>
[Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Very rarely IO#readpartial does not raise EOFError (Closed)" href="https://bugs.ruby-lang.org/issues/14841">#14841</a>]</p>
<p><a href="/issues/14999">[ruby-core:88503]</a> [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: ConditionVariable doesn't reacquire the Mutex if Thread#kill-ed (Closed)" href="https://bugs.ruby-lang.org/issues/14999">#14999</a>] [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Very rarely IO#readpartial does not raise EOFError (Closed)" href="https://bugs.ruby-lang.org/issues/14841">#14841</a>]</p>
<p>git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@64398 b2dd03c8-39d4-4d8f-98ff-823fe69b080e</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735712018-08-17T09:40:10ZEregon (Benoit Daloze)
<ul><li><strong>Subject</strong> changed from <i>ConditionVariable doesn't require the Mutex if Thread#kill-ed</i> to <i>ConditionVariable doesn't reacquire the Mutex if Thread#kill-ed</i></li></ul> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735802018-08-17T23:54:24ZEregon (Benoit Daloze)
<ul></ul><p><a class="user active user-mention" href="https://bugs.ruby-lang.org/users/724">@normalperson (Eric Wong)</a> I added the specs in r64409.<br>
However I just saw that the spec failed once on Ubuntu:<br>
<a href="https://rubyci.org/logs/rubyci.s3.amazonaws.com/ubuntu/ruby-trunk/log/20180817T213003Z.fail.html.gz" class="external">https://rubyci.org/logs/rubyci.s3.amazonaws.com/ubuntu/ruby-trunk/log/20180817T213003Z.fail.html.gz</a><br>
Any idea?</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735812018-08-18T00:32:46Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p><a class="user active user-mention" href="https://bugs.ruby-lang.org/users/724">@normalperson (Eric Wong)</a> I added the specs in r64409.<br>
However I just saw that the spec failed once on Ubuntu:<br>
<a href="https://rubyci.org/logs/rubyci.s3.amazonaws.com/ubuntu/ruby-trunk/log/20180817T213003Z.fail.html.gz" class="external">https://rubyci.org/logs/rubyci.s3.amazonaws.com/ubuntu/ruby-trunk/log/20180817T213003Z.fail.html.gz</a></p>
</blockquote>
<p>It might be because of rb_funcallv switching and hitting<br>
interrupts. I think this should fix it, but it can make<br>
an incompatibility if somebody relies on redefining<br>
Mutex#sleep...</p>
<pre><code>diff --git a/thread_sync.c b/thread_sync.c
index 5e511af0db..87c17316a9 100644
--- a/thread_sync.c
+++ b/thread_sync.c
@@ -1358,7 +1358,7 @@ static VALUE
do_sleep(VALUE args)
{
struct sleep_call *p = (struct sleep_call *)args;
- return rb_funcallv(p->mutex, id_sleep, 1, &p->timeout);
+ return rb_mutex_sleep(p->mutex, p->timeout);
}
static VALUE
</code></pre> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735902018-08-18T04:42:30Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p>Eric Wong wrote:</p>
<blockquote>
<p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p><a class="user active user-mention" href="https://bugs.ruby-lang.org/users/724">@normalperson (Eric Wong)</a> I added the specs in r64409.<br>
However I just saw that the spec failed once on Ubuntu:<br>
<a href="https://rubyci.org/logs/rubyci.s3.amazonaws.com/ubuntu/ruby-trunk/log/20180817T213003Z.fail.html.gz" class="external">https://rubyci.org/logs/rubyci.s3.amazonaws.com/ubuntu/ruby-trunk/log/20180817T213003Z.fail.html.gz</a></p>
</blockquote>
<p>It might be because of rb_funcallv switching and hitting<br>
interrupts. I think this should fix it, but it can make<br>
an incompatibility if somebody relies on redefining<br>
Mutex#sleep...</p>
</blockquote>
<p>Committed a version which only avoids switching on r64436<br>
We'll see if CI alerts are quieter, now.</p>
<p>However, I don't know what to do about Mutex_m since that still<br>
hits the funcall path. We can't blindly mask out interrupts<br>
because we need Mutex#sleep to react to Thread#run.</p>
<p>Also, please don't add similar specs for Mutex_m, yet.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735932018-08-18T06:32:45Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p>Eric Wong wrote:</p>
<blockquote>
<p>Committed a version which only avoids switching on r64436<br>
We'll see if CI alerts are quieter, now.</p>
</blockquote>
<p>No good. Now testing:<br>
<a href="https://80x24.org/spew/20180818062657.3424-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20180818062657.3424-1-e@80x24.org/raw</a><br>
(should commit when done testing, but I need food :x)</p>
<p>Also, <a href="https://bugs.ruby-lang.org/issues/15002" class="external">https://bugs.ruby-lang.org/issues/15002</a> should<br>
fix spurious wakeups from ConditionVariable#wait which<br>
might cause some spec failures under MJIT.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735982018-08-18T11:05:03ZEregon (Benoit Daloze)
<ul></ul><p>What was wrong with the first patch?<br>
It looks good from a quick glance, although indeed it doesn't deal with Mutex_m.<br>
Maybe Thread.handle_interrupt(Object => :on_blocking) (or the equivalent in C) would help?</p>
<blockquote>
<p>No good. Now testing:<br>
<a href="https://80x24.org/spew/20180818062657.3424-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20180818062657.3424-1-e@80x24.org/raw</a></p>
</blockquote>
<p>I don't think that's OK for semantics.<br>
IMHO, anything in the synchronize block should hold the Mutex, otherwise we break users' assumptions.<br>
So even if there is an exception waking ConditionVariable#wait, I believe users expect to own the Mutex during that exception, until getting out of the synchronize.<br>
Otherwise, it's very surprising if the ensure block sometimes has, sometimes doesn't have the Mutex and so can't reliably modify data structures protected by the Mutex.<br>
I also believe this was the behavior on Ruby 2.5 and older, but I might be wrong (at least I couldn't observe the spec to fail in thousands of tries).</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=735992018-08-18T11:13:06ZEregon (Benoit Daloze)
<ul><li><strong>Status</strong> changed from <i>Closed</i> to <i>Open</i></li></ul> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736002018-08-18T11:42:29Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p>What was wrong with the first patch?</p>
</blockquote>
<p>I still saw CI failures with it:</p>
<p><a href="http://ci.rvm.jp/results/trunk_gcc7@silicon-docker/1234097" class="external">http://ci.rvm.jp/results/trunk_gcc7@silicon-docker/1234097</a></p>
<p>In retrospect, it could've also been fixed with [Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: [PATCH] thread.c (sleep_*): reduce the effect of spurious interrupts (Closed)" href="https://bugs.ruby-lang.org/issues/15002">#15002</a>]</p>
<blockquote>
<p>It looks good from a quick glance, although indeed it doesn't deal with Mutex_m.<br>
Maybe Thread.handle_interrupt(Object => :on_blocking) (or the equivalent in C) would help?</p>
</blockquote>
<p>I considered that, but the problem might stem from blocking in<br>
an unexpected place (because CV#wait is being spuriously woken).</p>
<blockquote>
<blockquote>
<p>No good. Now testing:<br>
<a href="https://80x24.org/spew/20180818062657.3424-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20180818062657.3424-1-e@80x24.org/raw</a></p>
</blockquote>
<p>I don't think that's OK for semantics.<br>
IMHO, anything in the synchronize block should hold the Mutex, otherwise we break users' assumptions.<br>
So even if there is an exception waking ConditionVariable#wait, I believe users expect to own the Mutex during that exception, until getting out of the synchronize.<br>
Otherwise, it's very surprising if the ensure block sometimes has, sometimes doesn't have the Mutex and so can't reliably modify data structures protected by the Mutex.<br>
I also believe this was the behavior on Ruby 2.5 and older, but I might be wrong (at least I couldn't observe the spec to fail in thousands of tries).</p>
</blockquote>
<p>OK, I didn't consider the ensure case. So maybe you can revert<br>
r64441 (I'm going AFK for a while), because there's a chance<br>
r64444 also fixed the issue.</p>
<p>Last alert email I got was 2018-08-18 08:59Z, so it's been a<br>
while since I got an alert (this is from ko1's ruby-alerts<br>
list, I don't know if you're on it).</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736012018-08-18T13:50:26ZEregon (Benoit Daloze)
<ul></ul><p>OK, I'll revert r64441 since I think we should preserve the semantics of always being locked inside Mutex#synchronize, even if an exception interrupts ConditionVariable#wait.</p>
<p>Let's keep thinking about how to implement that correctly.<br>
Mutex_m would be nice to have correct too, but I think it's less critical as I suspect very few people use it.</p>
<p>FWIW, we're trying to optimize ConditionVariable in TruffleRuby and it's proving quite tricky to:</p>
<ul>
<li>not miss ConditionVariable#signal / ConditionVariable#broadcast</li>
<li>handles other interrupts in #wait</li>
<li>make sure to reacquire the Mutex when going out of #wait</li>
<li>still handle interrupts/thread switching in the re-acquire</li>
</ul>
<p>Maybe the original Ruby code can be an inspiration here: see lib/thread.rb at r42801 or<br>
<a href="https://github.com/ruby/ruby/blob/324df61eea3e4c107e419ea3c685eaea4da7fd5e/lib/thread.rb#L65-L81" class="external">https://github.com/ruby/ruby/blob/324df61eea3e4c107e419ea3c685eaea4da7fd5e/lib/thread.rb#L65-L81</a></p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736022018-08-18T13:54:09ZEregon (Benoit Daloze)
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Applied in changeset trunk|r64448.</p>
<hr>
<p>Revert r64441</p>
<ul>
<li>This reverts commit 647fc1227a4146ecbfeb0d59358abc8d99cd8ae6:<br>
"thread_sync.c (rb_mutex_synchronize): only unlock if we own the mutex"</li>
<li>Let's try to preserve the semantics of always being locked inside<br>
Mutex#synchronize, even if an exception interrupts ConditionVariable#wait.</li>
<li>As discussed on [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: ConditionVariable doesn't reacquire the Mutex if Thread#kill-ed (Closed)" href="https://bugs.ruby-lang.org/issues/14999">#14999</a>].</li>
</ul> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736032018-08-18T13:54:47ZEregon (Benoit Daloze)
<ul><li><strong>Status</strong> changed from <i>Closed</i> to <i>Open</i></li></ul> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736042018-08-18T15:56:31ZEregon (Benoit Daloze)
<ul></ul><p>It failed for <a href="http://ci.rvm.jp/results/trunk_gcc6@silicon-docker/1235550" class="external">http://ci.rvm.jp/results/trunk_gcc6@silicon-docker/1235550</a></p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736082018-08-18T19:35:25ZEregon (Benoit Daloze)
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Applied in changeset commit:ruby-git|c742050ea5fd30108f913383c0fafc4614adb04c.</p>
<hr>
<p>Revert r64441</p>
<ul>
<li>This reverts commit 647fc1227a4146ecbfeb0d59358abc8d99cd8ae6:<br>
"thread_sync.c (rb_mutex_synchronize): only unlock if we own the mutex"</li>
<li>Let's try to preserve the semantics of always being locked inside<br>
Mutex#synchronize, even if an exception interrupts ConditionVariable#wait.</li>
<li>As discussed on [Bug <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: ConditionVariable doesn't reacquire the Mutex if Thread#kill-ed (Closed)" href="https://bugs.ruby-lang.org/issues/14999">#14999</a>].</li>
</ul>
<p>git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@64448 b2dd03c8-39d4-4d8f-98ff-823fe69b080e</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736092018-08-18T21:22:41Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p>It failed for <a href="http://ci.rvm.jp/results/trunk_gcc6@silicon-docker/1235550" class="external">http://ci.rvm.jp/results/trunk_gcc6@silicon-docker/1235550</a></p>
</blockquote>
<p>Maybe wishful thinking, but r64464 might be the right fix.</p>
<p>Time for Me#sleep</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736102018-08-19T00:13:20ZEregon (Benoit Daloze)
<ul></ul><p>normalperson (Eric Wong) wrote:</p>
<blockquote>
<p>Maybe wishful thinking, but r64464 might be the right fix.</p>
</blockquote>
<p>Seems it's not yet fully fixed unfortunately:<br>
<a href="http://ci.rvm.jp/results/trunk_clang_39@silicon-docker/1236189" class="external">http://ci.rvm.jp/results/trunk_clang_39@silicon-docker/1236189</a><br>
<a href="http://ci.rvm.jp/results/trunk_gcc5@silicon-docker/1236243" class="external">http://ci.rvm.jp/results/trunk_gcc5@silicon-docker/1236243</a><br>
<a href="http://ci.rvm.jp/results/trunk-asserts-nopara@silicon-docker/1236250" class="external">http://ci.rvm.jp/results/trunk-asserts-nopara@silicon-docker/1236250</a><br>
<a href="http://ci.rvm.jp/results/trunk_gcc5@silicon-docker/1236262" class="external">http://ci.rvm.jp/results/trunk_gcc5@silicon-docker/1236262</a><br>
<a href="http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1236304" class="external">http://ci.rvm.jp/results/trunk-mjit-wait@silicon-docker/1236304</a></p>
<p>I think we need to give some thinking on this,<br>
and I don't want to stress you to fix it to fix the build.<br>
(although this should be solved soon IMHO, latest before the next preview/RC)</p>
<p>So I propose to "tag" the spec so it won't run in CI, so you can focus on it when there is time.<br>
I'll do that tomorrow if you agree.</p>
<p>It would be also be interesting to see if Ruby <= 2.5 actually ensured the semantics I described above.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736132018-08-19T10:04:06Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p>I think we need to give some thinking on this,<br>
and I don't want to stress you to fix it to fix the build.<br>
(although this should be solved soon IMHO, latest before the next preview/RC)</p>
</blockquote>
<p>It needs to be fixed ASAP.</p>
<p>r64467 seems to make CI happy, at least (along with r64398,<br>
which was an absolutely necessary fix). The key difference with<br>
r64467 fix is I figured out I needed to reproduce the failure<br>
reliably by using a single CPU (schedtool -a 0x1 -e ...)</p>
<blockquote>
<p>So I propose to "tag" the spec so it won't run in CI, so you can focus on it when there is time.<br>
I'll do that tomorrow if you agree.</p>
</blockquote>
<p>Please don't. CI is happy at the moment, at least.</p>
<blockquote>
<p>It would be also be interesting to see if Ruby <= 2.5 actually<br>
ensured the semantics I described above.</p>
</blockquote>
<p>I was able to reproduce the problem with the old<br>
timer-thread using "schedtool -a 0x1".</p>
<p>"git bisect" points to r63498 ("thread_pthread.c: enable thread<br>
cache by default"), which is HIGHLY unexpected.</p>
<p>I'm not seeing how thread caching can be the cause of the bug,<br>
but rather it makes an existing bug more apparent by being<br>
faster and hitting the race sooner.</p>
<p>The thread cache only operates at the pthreads level, so there's<br>
no way any Ruby-level interrupt could hit across Ruby Thread<br>
lifetimes. pthread_kill is never used in the<br>
Mutex/ConditionVariable code paths; and different pthreads<br>
condvars are used for sleeping/wakeups. So it is a mystery as<br>
to how thread caching ends up affecting Ruby-level interrupt<br>
handling...</p>
<p>What thread-caching does do is make the MSPECOPT "-R" option<br>
noticeably faster</p>
<p>Fwiw, I can still reproduce the failures with low timeouts:</p>
<pre><code>diff --git a/thread_pthread.c b/thread_pthread.c
index 2fd60ddd4a..e8da3ee9c2 100644
--- a/thread_pthread.c
+++ b/thread_pthread.c
@@ -187,7 +187,7 @@ do_gvl_timer(rb_vm_t *vm, rb_thread_t *th)
if (vm->gvl.timer_err == ETIMEDOUT) {
ts.tv_sec = 0;
- ts.tv_nsec = TIME_QUANTUM_NSEC;
+ ts.tv_nsec = 1;
ts = native_cond_timeout(&nd->cond.gvlq, ts);
}
vm->gvl.timer = th;
</code></pre>
<p>So there's definitely more to investigate until I'm satisified<br>
with the reliability of this (but I'm too tired, now).</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736142018-08-19T19:52:29Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p>Eric Wong wrote:</p>
<blockquote>
<p>Fwiw, I can still reproduce the failures with low timeouts:</p>
<pre><code>diff --git a/thread_pthread.c b/thread_pthread.c
index 2fd60ddd4a..e8da3ee9c2 100644
--- a/thread_pthread.c
+++ b/thread_pthread.c
@@ -187,7 +187,7 @@ do_gvl_timer(rb_vm_t *vm, rb_thread_t *th)
if (vm->gvl.timer_err == ETIMEDOUT) {
ts.tv_sec = 0;
- ts.tv_nsec = TIME_QUANTUM_NSEC;
+ ts.tv_nsec = 1;
ts = native_cond_timeout(&nd->cond.gvlq, ts);
}
vm->gvl.timer = th;
</code></pre>
<p>So there's definitely more to investigate until I'm satisified<br>
with the reliability of this (but I'm too tired, now).</p>
</blockquote>
<p>I think the only way to work reliably is to disable interrupt<br>
checking when attempting to lock a mutex in ensure:</p>
<p><a href="https://80x24.org/spew/20180819193954.13807-1-e@80x24.org/raw" class="external">https://80x24.org/spew/20180819193954.13807-1-e@80x24.org/raw</a></p>
<p>We can rely on deadlock checking to prevent problems; but it<br>
could still affect user-experience if Ctrl-C takes a while<br>
<em>shrug</em></p>
<p>Currently testing, will take a while.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736152018-08-19T21:42:53ZEregon (Benoit Daloze)
<ul></ul><p>normalperson (Eric Wong) wrote:</p>
<blockquote>
<p>r64467 seems to make CI happy, at least</p>
</blockquote>
<p>Great!</p>
<blockquote>
<p>I figured out I needed to reproduce the failure<br>
reliably by using a single CPU (schedtool -a 0x1 -e ...)</p>
</blockquote>
<p>I could also verify it fails before your commit, and passes after with:</p>
<pre><code>taskset -c 1 mspec-run -R1000 library/conditionvariable/wait_spec.rb
</code></pre>
<blockquote>
<p>"git bisect" points to r63498 ("thread_pthread.c: enable thread<br>
cache by default"), which is HIGHLY unexpected.</p>
</blockquote>
<p>So it seems it doesn't happen in Ruby <= 2.5 due to thread caching being disabled then?</p>
<blockquote>
<p>I'm not seeing how thread caching can be the cause of the bug,<br>
but rather it makes an existing bug more apparent by being<br>
faster and hitting the race sooner.</p>
</blockquote>
<p>No idea either, but I guess a bug could lurk in there.<br>
I did experience a few weird bugs when doing thread caching in TruffleRuby<br>
(e.g., inline caches based on pthread_self() are incorrect as multiple Ruby threads could use the same pthread thread)</p>
<blockquote>
<p>What thread-caching does do is make the MSPECOPT "-R" option<br>
noticeably faster</p>
</blockquote>
<p>Right, because creating threads is so much faster.<br>
Maybe the fact pthread startup is done each time has some effect on races<br>
(e.g., the pthread thread has a perfect view of the caller thread when starting,<br>
but not true with caching, but maybe the GVL still ensure that)</p>
<blockquote>
<p>I think the only way to work reliably is to disable interrupt<br>
checking when attempting to lock a mutex in ensure:</p>
</blockquote>
<p>At least if the interrupt causes to escape the function (e.g., Thread#raise/Thread#kill),<br>
then that escape must be delayed after re-acquiring the Mutex.</p>
<p>In normal Mutex#lock and Mutex#synchronize, it's fine to raise/kill before taking the Mutex,<br>
because it's before entering the critical section, but unfortunately here we are inside the critical section.</p>
<blockquote>
<p>it could still affect user-experience if Ctrl-C takes a while</p>
</blockquote>
<p>When would that happen?<br>
If some other Thread holds the Mutex too long?<br>
It's probably never a good idea to hold a Mutex for seconds or a long time, especially when using a ConditionVariable.<br>
It seems fair enough to delay the interrupt in such a case, as the main Thread is waiting for the Mutex.<br>
Maybe we could warn if re-acquiring takes too long and it becomes a problem in practice (I think it won't).</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736162018-08-19T23:12:34Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p>normalperson (Eric Wong) wrote:</p>
<blockquote>
<p>"git bisect" points to r63498 ("thread_pthread.c: enable<br>
thread cache by default"), which is HIGHLY unexpected.</p>
</blockquote>
<p>So it seems it doesn't happen in Ruby <= 2.5 due to thread<br>
caching being disabled then?</p>
</blockquote>
<p>Right; but that might just be "luck"...</p>
<blockquote>
<blockquote>
<p>I'm not seeing how thread caching can be the cause of the<br>
bug, but rather it makes an existing bug more apparent by<br>
being faster and hitting the race sooner.</p>
</blockquote>
<p>No idea either, but I guess a bug could lurk in there. I did<br>
experience a few weird bugs when doing thread caching in<br>
TruffleRuby (e.g., inline caches based on pthread_self() are<br>
incorrect as multiple Ruby threads could use the same pthread<br>
thread)</p>
</blockquote>
<p>At least for the core VM and standard library; I don't think<br>
pthread_self is a factor; and we already clobber our TSD<br>
pointer since r63836</p>
<blockquote>
<blockquote>
<p>What thread-caching does do is make the MSPECOPT "-R"<br>
option noticeably faster</p>
</blockquote>
<p>Right, because creating threads is so much faster. Maybe the<br>
fact pthread startup is done each time has some effect on<br>
races (e.g., the pthread thread has a perfect view of the<br>
caller thread when starting, but not true with caching, but<br>
maybe the GVL still ensure that)</p>
</blockquote>
<p>Yes, it might affect accounting done in the kernel; particularly<br>
when bound to a single CPU. And it still takes -R1000 to<br>
reliably reproduce the problem with a single CPU; -R500 was<br>
sometimes successful, even.</p>
<blockquote>
<blockquote>
<p>I think the only way to work reliably is to disable<br>
interrupt checking when attempting to lock a mutex in<br>
ensure:</p>
</blockquote>
<p>At least if the interrupt causes to escape the function (e.g.,<br>
Thread#raise/Thread#kill), then that escape must be delayed<br>
after re-acquiring the Mutex.</p>
</blockquote>
<p>Yes, so I've committed r64476</p>
<blockquote>
<p>In normal Mutex#lock and Mutex#synchronize, it's fine to<br>
raise/kill before taking the Mutex, because it's before<br>
entering the critical section, but unfortunately here we are<br>
inside the critical section.</p>
</blockquote>
<p>So this is why pthread_mutex_lock is forbidden from returning<br>
EINTR by POSIX.</p>
<p>Mutex#lock/synchronize should have been uninterruptible, too;<br>
but it's too late to change that as it would break compatibility<br>
with existing code (bootstraptest/test_thread.rb already breaks).</p>
<blockquote>
<blockquote>
<p>it could still affect user-experience if Ctrl-C takes a while</p>
</blockquote>
<p>When would that happen?<br>
If some other Thread holds the Mutex too long?</p>
</blockquote>
<p>Yes, but I suppose it's not a real problem.</p>
<blockquote>
<p>It's probably never a good idea to hold a Mutex for seconds or<br>
a long time, especially when using a ConditionVariable.<br>
It seems fair enough to delay the interrupt in such a case, as<br>
the main Thread is waiting for the Mutex.</p>
</blockquote>
<p>Agreed(*).</p>
<blockquote>
<p>Maybe we could warn if re-acquiring takes too long and it<br>
becomes a problem in practice (I think it won't).</p>
</blockquote>
<p>We already have deadlock checking to alert users to real<br>
problems, so it's probably not a problem in practice.</p>
<p>(*) Along those lines, I think the "lock" idiom for GVL is not<br>
ideal for performance (kosaki rewrote the GVL for 1.9.3 to<br>
optimize for contention as a result). I might try replacing<br>
the GVL with a platform-independent scheduler which limits<br>
parallelism to the data structures the GVL currently protects.</p>
<p>Note: This will NOT solve the parallelism problem which<br>
exists in C Ruby; that is a much bigger task (but still<br>
doable with a scheduler, and perhaps even more so).</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736272018-08-20T13:30:44ZEregon (Benoit Daloze)
<ul></ul><p>normalperson (Eric Wong) wrote:</p>
<blockquote>
<p>(*) Along those lines, I think the "lock" idiom for GVL is not<br>
ideal for performance (kosaki rewrote the GVL for 1.9.3 to<br>
optimize for contention as a result). I might try replacing<br>
the GVL with a platform-independent scheduler which limits<br>
parallelism to the data structures the GVL currently protects.</p>
</blockquote>
<p>I don't follow, which data structures would the scheduler protect?<br>
Internal VM data structures but not e.g. Ruby Hash?</p>
<blockquote>
<p>Note: This will NOT solve the parallelism problem which<br>
exists in C Ruby; that is a much bigger task (but still<br>
doable with a scheduler, and perhaps even more so).</p>
</blockquote>
<p>Now I am curious, how would a scheduler solve the<br>
lack of parallelism at the Ruby level in CRuby?</p>
<p>I'm particularly interested since I looked at concurrency in Ruby<br>
and specifically TruffleRuby during my PhD, and how to have GIL-like<br>
guarantees and yet allow to run Ruby code in parallel.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=736332018-08-20T20:12:27Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:eregontp@gmail.com" class="email">eregontp@gmail.com</a> wrote:</p>
<blockquote>
<p>normalperson (Eric Wong) wrote:</p>
<blockquote>
<p>(*) Along those lines, I think the "lock" idiom for GVL is not<br>
ideal for performance (kosaki rewrote the GVL for 1.9.3 to<br>
optimize for contention as a result). I might try replacing<br>
the GVL with a platform-independent scheduler which limits<br>
parallelism to the data structures the GVL currently protects.</p>
</blockquote>
<p>I don't follow, which data structures would the scheduler protect?<br>
Internal VM data structures but not e.g. Ruby Hash?</p>
</blockquote>
<p>Everything the GVL currently protects, including the Ruby Hash.</p>
<blockquote>
<blockquote>
<p>Note: This will NOT solve the parallelism problem which<br>
exists in C Ruby; that is a much bigger task (but still<br>
doable with a scheduler, and perhaps even more so).</p>
</blockquote>
<p>Now I am curious, how would a scheduler solve the<br>
lack of parallelism at the Ruby level in CRuby?</p>
</blockquote>
<p>It might allow cheaper entry/exit for blocking regions;<br>
so we can use more blocking regions to achieve parallelism.</p>
<p>GVL release/acquire is expensive at the moment; so there's<br>
places where we hold the GVL (e.g. readdir calls) and sacrifice<br>
parallelism when dealing with slow filesystems.</p>
<p>Also, pthread_cond_wait is just like ConditionVariable#wait in<br>
that it must re-lock the mutex after sleeping. This re-lock<br>
can be contended, and it should not be necessary if we move<br>
away from condvars for some platforms (Linux and win32).<br>
The signaller (FUTEX_WAKE/SetEvent) can remove from the waitqueue<br>
while it's holding the lock, so the sleeper won't have to.</p>
<p>Anyways, just a small idea in my head and totally unproven.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=738912018-09-05T07:05:32Zlarskanis (Lars Kanis)
<ul></ul><blockquote>
<blockquote>
<p>So it seems it doesn't happen in Ruby <= 2.5 due to thread<br>
caching being disabled then?</p>
</blockquote>
<p>Right; but that might just be "luck"...</p>
</blockquote>
<p>I get this error on ruby-2.4.4 [i386-mingw32] and [i386-mingw32] while nightly builds of RubyInstaller: <a href="https://ci.appveyor.com/project/larskanis/rubyinstaller2-hbuor/build/1.0.722/job/dva7ug4vako5jhrl#L1295" class="external">https://ci.appveyor.com/project/larskanis/rubyinstaller2-hbuor/build/1.0.722/job/dva7ug4vako5jhrl#L1295</a><br>
It appears with a probability of around 50%.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=738992018-09-05T12:18:13ZEregon (Benoit Daloze)
<ul></ul><p><a class="user active user-mention" href="https://bugs.ruby-lang.org/users/5550">@larskanis (Lars Kanis)</a> Thanks for the report.<br>
I also noticed it happens on RubyCI for Ruby 2.4 and earlier<br>
(example: <a href="https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-2.4/log/20180905T105615Z.fail.html.gz" class="external">https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-2.4/log/20180905T105615Z.fail.html.gz</a>)</p>
<p>I'm not sure what's best.</p>
<p><a class="user active user-mention" href="https://bugs.ruby-lang.org/users/724">@normalperson (Eric Wong)</a> Do you think we could backport the fix (or the logic behind the fix since the code changed quite a bit) to Ruby 2.4 and 2.3?<br>
Interestingly enough, it seems to not fail on Ruby 2.5 at least in recent RubyCI builds.</p>
<p>For now, I'll mark the spec as ruby_bug on Ruby < 2.5.</p> Ruby master - Bug #14999: ConditionVariable doesn't reacquire the Mutex if Thread#kill-edhttps://bugs.ruby-lang.org/issues/14999?journal_id=739282018-09-06T17:42:38Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:lars@greiz-reinsdorf.de" class="email">lars@greiz-reinsdorf.de</a> wrote:</p>
<blockquote>
<p>Issue <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: ConditionVariable doesn't reacquire the Mutex if Thread#kill-ed (Closed)" href="https://bugs.ruby-lang.org/issues/14999">#14999</a> has been updated by larskanis (Lars Kanis).</p>
<blockquote>
<blockquote>
<p>So it seems it doesn't happen in Ruby <= 2.5 due to thread<br>
caching being disabled then?</p>
</blockquote>
<p>Right; but that might just be "luck"...</p>
</blockquote>
<p>I get this error on ruby-2.4.4 [i386-mingw32] and [i386-mingw32] while nightly builds of RubyInstaller: <a href="https://ci.appveyor.com/project/larskanis/rubyinstaller2-hbuor/build/1.0.722/job/dva7ug4vako5jhrl#L1295" class="external">https://ci.appveyor.com/project/larskanis/rubyinstaller2-hbuor/build/1.0.722/job/dva7ug4vako5jhrl#L1295</a><br>
It appears with a probability of around 50%.</p>
</blockquote>
<p>Thanks for the reminder (this bugfix already feels like a year ago):</p>
<p>Backport requested: <a href="https://bugs.ruby-lang.org/issues/15084" class="external">https://bugs.ruby-lang.org/issues/15084</a></p>