Project

General

Profile

Backport #3212

ConditionVariable may become inconsistent for interrupted threads

Added by sylvain.joyeux (Sylvain Joyeux) about 9 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
[ruby-core:29835]

Description

=begin
While tracking another issue with ConditionVariable, I realized that ConditionVariable#wait assumes that the waking-up side (#signal and #broadcast) will remove the thread from the list of waiters.

Unfortunately, Mutex#sleep may return right away if an interrupt is set for the thread, in which case the thread will stay in the condition variable's waiting list.
=end

History

#1

Updated by mame (Yusuke Endoh) about 9 years ago

  • Status changed from Open to Rejected

=begin
Hi,

2010/4/28 Sylvain Joyeux redmine@ruby-lang.org:

While tracking another issue with ConditionVariable, I realized that ConditionVariable#wait assumes that the waking-up side (#signal and #broadcast) will remove the thread from the list of waiters.

Unfortunately, Mutex#sleep may return right away if an interrupt is set for the thread, in which case the thread will stay in the condition variable's waiting list.

I believe the behavior does not cause actual problem as long
as you use ConditionVarialbe correctly.

It is a spec that ConditionVariable may return even if it is
not signaled. You MUST use ConditionVariable#wait in until-
loop to check your condition, like:

mutex = Mutex.new
cv = ConditionVariable.new

a = Thread.start do
mutex.synchronize do
...
until (... "your condition" ...)
cv.wait(mutex)
end
...
end
end

b = Thread.start do
mutex.synchronize do
...
# operation to change "your condition"
...
cv.signal
end
end

In this usage, the waiting thread does not escape from until-
loop until the ConditionVariable is signaled.

This clumsy API came from C's condition variable (pthread_cond_
wait or so). I guess the API may be improved to be handy, but
it is feature request.

So this ticket is false. I close it.

If you think you are using CondtionVariable correctly, could
you elaborate the problem you are facing?
There may be another scenario of bug that I cannot think of.

--
Yusuke Endoh mame@tsg.ne.jp
=end

#2

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
This is not only completely wrong, it also goes against any multi-threading API that exists out there.

pthread_cond_wait can not return in any other case than the condition variable being signaled. The only other way would be thread cancellation, in which case, technically, pthread_cond_wait does not return: the thread is destroyed.

I see now that Java does have this semantic. This is unfortunate (and very bad), but I guess I'll have to deal with it.
=end

#3

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
Update: I did realize that some APIs do behave that way already.

(Disclaimer: next sentence is kind-of a rant). I guess it is only disturbing for users of sane multi-threading APIs.
=end

#4

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
Yusuke's answer and mine apply to #3213, not to this bug.

This bug is a different problem. In Yusuke's example, the following happens if 'a' gets a spurious wakeup:

The CV's internal @waiters array still contains 'a'. Then, the next call to #wait() will add it again (@waiters contains 'a' two times). When a "real" wakeup happens, @waiters still has 'a' in it and the next call to #signal or #broadcast will wake 'a' up even if 'a' does not wait on the condition variable.

IMO, #wait should ensure that the current thread has been removed from the waiter's list:

def wait(mutex)
t = Thread.current
@waiters_mutex.synchronize do
@waiters.push(t)
end
mutex.sleep
rescue
@waiters_mutex.synchronize do
@waiters.delete(t)
end
end

=end

#5

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

2010/5/3 Sylvain Joyeux redmine@ruby-lang.org:

pthread_cond_wait can not return in any other case than the condition variable being signaled. The only other way would be thread cancellation, in which case, technically, pthread_cond_wait does not return: the thread is destroyed.

http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_wait.html

Spurious wakeups from the pthread_cond_timedwait() or
pthread_cond_wait() functions may occur.

--
Yusuke Endoh mame@tsg.ne.jp

=end

#6

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

2010/5/3 Sylvain Joyeux redmine@ruby-lang.org:

Yusuke's answer and mine apply to #3213, not to this bug.

This bug is a different problem. In Yusuke's example, the following happens if 'a' gets a spurious wakeup:

The CV's internal @waiters array still contains 'a'. Then, the next call to #wait() will add it again (@waiters contains 'a' two times). When a "real" wakeup happens, @waiters still has 'a' in it and the next call to #signal or #broadcast will wake 'a' up even if 'a' does not wait on the condition variable.

No, 'a' does wait on the CV as long as you are using CV#wait
correctly.

1) Thread 'a' gets a spurious wakeup
2) Thread 'a' evaluates your condition (predicate) and figures out
it false (because the CV has not been signaled formally by 'b')
3) Thread 'a' waits the CV again

4) Thread 'b' call CV#signal which makes wake 'a' up

#4 cannot occur between #1 and #3 because the corresponding mutex
should be locked when calling CV#signal.

If you have any concrete problem, please elaborate it.

--
Yusuke Endoh mame@tsg.ne.jp

=end

#7

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
This is not about having spurious wakeups or not (I understood that my knowledge of pthreads is lower than yours, and I apologize for the noise). This is about maintaining class invariants, in this case: "threads that are listed in @waiters should be the threads waiting for the CV".

A few scenarios:

Scenario 1


1) Thread 'a' gets a spurious wakeup
2) Thread 'a' gets back to wait() because it is in an until loop. @waiters contains 'a' two times.
3) Thread 'b' wakes 'a' up
4) Thread 'a' leaves the synchronization section. 'a' is still listed in @waiters (the bug I am reporting)
5) Later on, 'a' goes back to sleep (for whatever reason)
6) Thread 'b' calls CV#signal and wakes up 'a', since 'a' is still in @waiters

Scenario 2


1) Thread 'a' gets a spurious wakeup
2) Thread 'a' gets back to sleep *not using CV#wait" (using mutex#sleep or Thread.sleep, for whatever reason)
3) Thread 'b' calls CV#signal, which wakes 'a' up since 'a' is still in the @waiters list.

In (3), 'a' was not waiting on the CV anymore. But 'b' called a#run anyway.

Scenario 3


This is a modified version of Scenario (1)

1) Thread 'a' gets a spurious wakeup
2) Thread 'a' gets back to wait() because it is in an until loop. @waiters contains 'a' two times.
3) Thread 'b' wakes 'a' up
4) Thread 'a' leaves the synchronization section. 'a' is still listed in @waiters (the bug I am reporting)
5) Thread 'b' calls CV#signal and wakes up 'a', which does not wait for the CV anymore

I.e. CV#signal in (5) does not wake up anybody, which most likely will lead to a deadlock

=end

#8

Updated by mame (Yusuke Endoh) about 9 years ago

  • Status changed from Rejected to Open

=begin
Hi,

Scenario 1

1) Thread 'a' gets a spurious wakeup
2) Thread 'a' gets back to wait() because it is in an until loop. @waiters contains 'a' two times.
3) Thread 'b' wakes 'a' up
4) Thread 'a' leaves the synchronization section. 'a' is still listed in @waiters (the bug I am reporting)
5) Later on, 'a' goes back to sleep (for whatever reason)
6) Thread 'b' calls CV#signal and wakes up 'a', since 'a' is still in @waiters

You are right! I never thought of this scenario.
I think CV#wait should add the thread to @waiter unless it is not
contained yet.

Scenario 2

1) Thread 'a' gets a spurious wakeup
2) Thread 'a' gets back to sleep *not using CV#wait" (using mutex#sleep or Thread.sleep, for whatever reason)
3) Thread 'b' calls CV#signal, which wakes 'a' up since 'a' is still in the @waiters list.

#2 of this scenario is impossible, because the corresponding
predicate is still false. By the while loop, Thread 'a' must get
back to CV#wait. But I doubt if I'm right.

reproducing script:
require "thread"

m = Mutex.new
cv = ConditionVariable.new
flag = false

# three waiting threads
a, b, c = (0..2).map do |i|
Thread.new do
m.synchronize do
2.times do
cv.wait(m) until flag
p "Thread #{i} wakeup"
flag = false
end
end
n = sleep 3
p "Thread #{i} wrong wakeup" if n < 3
end
end

# spurious wakeup
sleep 0.5
c.run
sleep 0.5
c.run

# wake the threads up
6.times do
sleep 0.5
m.synchronize { flag = true; cv.signal }
end

a.join
b.join
c.join

expected transcript:
"Thread 0 wakeup"
"Thread 1 wakeup"
"Thread 2 wakeup"
"Thread 0 wakeup"
"Thread 1 wakeup"
"Thread 2 wakeup"

actual transcript:
"Thread 0 wakeup"
"Thread 1 wakeup"
"Thread 2 wakeup"
"Thread 2 wakeup"
"Thread 2 wrong wakeup"
"Thread 0 wakeup"
t.rb:35:in join': deadlock detected (fatal)
from t.rb:35:in
'

diff --git a/lib/thread.rb b/lib/thread.rb
index f3831a7..d3b789b 100644
--- a/lib/thread.rb
+++ b/lib/thread.rb
@@ -66,7 +66,7 @@ class ConditionVariable
begin
# TODO: mutex should not be used
@waiters_mutex.synchronize do

  • @waiters.push(Thread.current)
  • @waiters.push(Thread.current) unless @waiters.include?(Thread.current) end mutex.sleep timeout end

--
Yusuke Endoh mame@tsg.ne.jp
=end

#9

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
Doing that is not enough. CV should definitely maintain @waiter so that no thread that is not in #wait is not included in it.

Another example: using Thread#raise. Your fix will not work, since we don't loop.

=end

#10

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
Example script for my last update

cv = ConditionVariable.new
mutex = Mutex.new

other_t = Thread.new do
mutex.synchronize do
begin
loop do
STDERR.puts "waiting"
cv.wait(mutex)
end
ensure
waiters = cv.instance_variable_get(:@waiters)
if !waiters.empty?
STDERR.puts "@waiters is not empty"
end
end
end
end

sleep 1
other_t.raise Interrupt if other_t
other_t.join

Displays "@waiters not empty"
=end

#11

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

2010/5/5 Sylvain Joyeux redmine@ruby-lang.org:

Doing that is not enough. CV should definitely maintain @waiter so that no thread that is not in #wait is not included in it.

Another example: using Thread#raise. Your fix will not work, since we don't loop.

Ruby's ConditionVariable API obviously copies C's.
So an exception is not considered.

If you use an exception, you have to care it yourself, such as,
by adding rescue clause:

a = Thread.start do
mutex.synchronize do
...
exc = nil
until enc || (... "your condition" ...)
begin
cv.wait(mutex)
rescue Exception
exc = $!
end
end
raise exc if exc
...
end
end

ConditionVariable API design is not Ruby-way at all.
It can be more elegant, though it is very difficult to design.
Anyway, it is a feature request.

At heart, I agree with you. But the fix you suggested in
[ruby-core:29961] adds new critical section. It looks benign,
but I'm not sure it is truly benign. I'm afraid whether it has
unexpected harmful effects, such as deadlock or big performance
degradation.
Now is under release management, so I don't want to accept such
a fix without assurance.

Could anyone who is familiar with multi-thread programming check
the fix in [ruby-core:29961] and assure it has no problem?

--
Yusuke Endoh mame@tsg.ne.jp

=end

#12

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

If you use an exception, you have to care it yourself, such as,
by adding rescue clause:

a = Thread.start do
mutex.synchronize do
...
exc = nil
until enc || (... "your condition" ...)
begin
cv.wait(mutex)
rescue Exception
exc = $!
end
end
raise exc if exc
...
end
end

Sorry, I realized even this code does not work correctly.
Thread 'a' may be left in @waiters after the until loop.
Multi-thread programming is very difficult for me :-(

--
Yusuke Endoh mame@tsg.ne.jp

=end

#13

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
Ruby allows condition variables to be interrupted by exceptions, and it should therefore work when it is used. It is a grave bug, as it will lead to weird, hard-to-debug situations for people that use Ruby normal features (exception handling and multi-threading).

I'm really getting tired of having to deal with multi-threading issues with Ruby. If MT is so low-priority in the MRI world, maybe you guys should remove it completely from the interpreter. This situation where "might, might not work, don't know" is really bad and detrimental to Ruby's image. I already have a time hard enough to advocate the usage of ruby vs. Python (python is much more popular in the robotics field), that I would really prefer not having to explain to people that these low-level bugs are not even fixed.

The fix I proposed will cause a competition on the waiters mutex on wakeup. This has an effect only if the CV is not used in the context of an external mutex (which should never be the case). Moreover, it is a very low impact as there is very little operations done while taking this mutex (only modifications to @waiters). In the end, you could have priority inversions problems, but if you start being picky about those, you would have to rewrite the whole of the MRI multi-threading implementation.
=end

#14

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

2010/5/5 Sylvain Joyeux redmine@ruby-lang.org:

Ruby allows condition variables to be interrupted by exceptions, and it should therefore work when it is used. It is a grave bug, as it will lead to weird, hard-to-debug situations for people that use Ruby normal features (exception handling and multi-threading).

Umm, Okay. I'll accept your fix unless there is objection.

But, you should not use Thread#raise if you are serious about multi-
thread programming. It has a potential race condition:

1) an exception is raised by another thread
2) rescue/ensure clause is going to be executed
3) another exception is raised by another thread
4) rescue/ensure clause are not executed

An exception raised by another thread cannot be handled safely.

I'm really getting tired of having to deal with multi-threading issues with Ruby.

I understand your irrits, but in fact, Ruby's thread is very immature
in a sense of both design and implementation. This is unfortunate, but
the fact.

If MT is so low-priority in the MRI world, maybe you guys should remove it completely from the interpreter.

I think it is good idea, with no apparent sense of irony, though
it is of course impossible.

you would have to rewrite the whole of the MRI multi-threading implementation.

I think so, honestly.
We need contribution from those who are familiar with multi-thread
programing.

--
Yusuke Endoh mame@tsg.ne.jp

=end

#15

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

2010/5/6 Caleb Clausen vikkous@gmail.com:

AFAICT, Silvain's proposed fix is mostly correct. I've made a couple
of minor revisions below: the rescue clause should read 'rescue
Exception' to ensure the cleanup happens on all exceptions, and I've
added a raise at the end so the exception will be re-raised rather
than having a normal return. Silvain, (or Yusuke, or anyone) please do
correct me if either of these is wrong.

BTW, what behavior do you expect when an exception is raised
during ConditionVariable#wait?

Currently, the call to CV#wait may aborted without locking
the corresponding mutex:

require "thread"

m = Mutex.new
cv = ConditionVariable.new

th = Thread.new do
m.synchronize do
begin
cv.wait(m)
ensure
sleep 0.5
p m.locked? #=> false, expceted true
break
end
end
end

sleep 0.5
m.lock
th.raise
th.raise
m.unlock
th.join

This is a serious issue, but if this is fixed to ensure to lock
the mutex by ignoring an exception, the process can not be
stopped by Ctrl+C. It is another serious issue.

I guess there is no answer satisfying everbody, but I think we
should accept the fact that Thread#raise is not in line with CV.

In addition, if we are really serious for using CV, signal must
not be converted to an exception, such as SIGINT and Interrupted.
It can be achieved by Kernel#trap, i.e., trap(:INT) {}, but it
is too stoic to enjoy programming...

Am I the only one who likes semaphores? Silvain, I'm particularly
interested in hearing your opinion. Would a semaphore have suited your
purposes as well as a condvar does?

Agreed. CV is too difficult.

--
Yusuke Endoh mame@tsg.ne.jp

=end

#16

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

2010/5/7 Caleb Clausen vikkous@gmail.com:

This is a serious issue, but if this is fixed to ensure to lock
the mutex by ignoring an exception, the process can not be
stopped by Ctrl+C. ?It is another serious issue.

I can't say I entirely understand. Do you mean that ignoring the 2nd
Thread#raise causes ctrl-c to be ignored?

Exactly.

Can a special case for
ctrl-c be made? Is there a more complete discussion of this issue
somewhere?

Sorry, this is discussed in Japanese closed IRC with some committers
including ko1, akr and kosaki. To sum up the discussion, we thought
convenience of Ctrl+C is more important (for casual users) then
consistency of CV#wait.
And, a techie who want consistency of CV#wait can achieve it by
avoiding Thread#raise and by paying attention to signal.

In addition, if we are really serious for using CV, signal must
not be converted to an exception, such as SIGINT and Interrupted.
It can be achieved by Kernel#trap, i.e., trap(:INT) {}, but it
is too stoic to enjoy programming...

All signals are sent to the main thread, aren't they? So at least the
user needs to worry about this only if using a CV in the main thread.

Exactly. You are really quine to understand :-)

--
Yusuke Endoh mame@tsg.ne.jp

=end

#17

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
2010/5/7 Yusuke ENDOH mame@tsg.ne.jp:

Exactly.  You are really quine to understand :-)

quine -> quick

--
Yusuke Endoh mame@tsg.ne.jp

=end

#18

Updated by sylvain.joyeux (Sylvain Joyeux) about 9 years ago

=begin
I'm using Thread#raise in conditions where it is safe, i.e. to wake-up threads that are waiting on CV, in places where I know it makes sense. And I find that much cleaner than having to set a per-thread variable telling the thread that he has to raise its own exception.

As for not locking the mutex on a second exception, it is linked with the fact that Thread#raise should not be used "blindly". I can accept that. I actually make sure that there is no locking in my applications in the path of the main thread (everything happens in separate threads) so that the applications can shutdown cleanly if the user presses Ctrl+C.

I'm not sure what you are referring to with the "consistency of CV#wait vs. CTRL+C" do you refer to the mutex#lock being cancelled by a second exception or to my proposed fix ?

=end

#19

Updated by mame (Yusuke Endoh) about 9 years ago

=begin
Hi,

2010/5/8 Sylvain Joyeux redmine@ruby-lang.org:

I'm not sure what you are referring to with the "consistency of CV#wait vs. CTRL+C" do you refer to the mutex#lock being cancelled by a second exception or to my proposed fix ?

I meant the former.

Your fix will be imported. Please wait a while longer!

--
Yusuke Endoh mame@tsg.ne.jp

=end

Updated by kosaki (Motohiro KOSAKI) over 6 years ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)
  • Status changed from Open to Closed

ConditionVariable @waiter issue was fixed at r38080.

Also available in: Atom PDF