Project

General

Profile

Actions

Bug #19473

open

can't be called from trap context (ThreadError) is too limiting

Added by Eregon (Benoit Daloze) over 2 years ago. Updated 2 days ago.

Status:
Open
Assignee:
-
Target version:
-
ruby -v:
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
[ruby-core:112664]

Description

Simple reproducer:

$ ruby -ve 'm=Mutex.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
-e:1:in `synchronize': can't be called from trap context (ThreadError)
	from -e:1:in `block in <main>'
	from -e:1:in `kill'
	from -e:1:in `<main>'

Expected behavior:

$ ruby -ve 'm=Mutex.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
truffleruby 22.3.1, like ruby 3.0.3, GraalVM CE Native [x86_64-linux]
:OK

$ ruby -ve 'm=Mutex.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
jruby 9.4.0.0 (3.1.0) 2022-11-23 95c0ec159f OpenJDK 64-Bit Server VM 17.0.6+10 on 17.0.6+10 +jit [x86_64-linux]
:OK

This exception is highly problematic, for instance it breaks Timeout.timeout in trap:
https://github.com/ruby/timeout/issues/17#issuecomment-1142035939

I suppose this behavior is because sometimes it's problematic to lock a Mutex in trap, e.g., if it's already locked by the main thread/fiber.
But that would otherwise already raise deadlock; recursive locking (ThreadError), so there is no point to fail early.
And that's just one case, not all, so we should not always raise an exception.

There seems to be no valid reason to prevent all Mutex#synchronize in trap.
After all, if the Mutex for instance is only used in trap, it's well-defined AFAIK.
For instance a given trap handler does not seem executed concurrently:

$ ruby -ve 'trap(:HUP) { puts "in trap\n"+caller.join("\n")+"\n\n"; sleep 0.1 }; pid = Process.pid; Process.wait fork { 20.times { Process.kill :HUP, pid } }; sleep 1'
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
in trap
-e:1:in `wait'
-e:1:in `<main>'

in trap
-e:1:in `wait'
-e:1:in `<main>'

in trap
-e:1:in `wait'
-e:1:in `<main>'

in trap
-e:1:in `wait'
-e:1:in `<main>'

in trap
-e:1:in `wait'
-e:1:in `<main>'

in trap
-e:1:in `wait'
-e:1:in `<main>'

And if the trap handler using the Mutex is never called while the Mutex is held by the main thread/fiber, there is also no problem.

Updated by Eregon (Benoit Daloze) over 2 years ago

Also Monitor has the same problem on CRuby:

$ ruby -ve 'm=Monitor.new; trap(:HUP) { m.synchronize { p :OK } }; Process.kill :HUP, Process.pid; sleep 0.1'
ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]
-e:1:in `synchronize': can't be called from trap context (ThreadError)
	from -e:1:in `block in <main>'
	from -e:1:in `kill'
	from -e:1:in `<main>'

And again it works fine on truffleruby and jruby

And since Monitor is reentrant there is no reason at all (AFAIK) to prevent it in trap.

Actions #2

Updated by Eregon (Benoit Daloze) over 2 years ago

  • ruby -v set to ruby 3.2.1 (2023-02-08 revision 31819e82c8) [x86_64-linux]

Updated by Eregon (Benoit Daloze) over 2 years ago

Another approach would be to do it like e.g. Java which runs signal handlers on a separate thread.
Then there is no potential problem of the signal handler being reentrant to the main thread/fiber.
But that is a much bigger semantic change of course, so I think just removing the can't be called from trap context (ThreadError) is much more actionable.

Actions #4

Updated by Eregon (Benoit Daloze) over 2 years ago

  • Description updated (diff)

Updated by ioquatix (Samuel Williams) over 2 years ago

I was under the impression that very few operations are safe to call from a signal handler. Is that not the case here?

Updated by Eregon (Benoit Daloze) over 2 years ago

ioquatix (Samuel Williams) wrote in #note-5:

I was under the impression that very few operations are safe to call from a signal handler. Is that not the case here?

I think you are thinking about C signal handlers and C async-safety. This is not relevant here, the Ruby signal handler doesn't run inside the C signal handler.

Updated by ko1 (Koichi Sasada) over 2 years ago

The current limitation is introduced to protect user from dead-lock error like that:

m = Mutex.new

trap(:INT){m.synchronizaton{ do_something } }

m.lock; sleep
# C-c here

will cause deadlock and it is hard to predict and coding for it. ( https://github.com/ruby/timeout/issues/17#issuecomment-1461498517 is one example)

Another approach would be to do it like e.g. Java which runs signal handlers on a separate thread.

Yes. There was a discussion to introduce a thread for signal handlers. But it was not introduced maybe because it is too much for many cases.

akr had proposed that if someone want to use such case, how about to make a thread in a trap handler for workaround. I think it is reasonable.

Updated by Eregon (Benoit Daloze) over 2 years ago

Right, like I mentioned above.
But the deadlock might not happen (or even cannot happen for some cases I mentioned above), and then CRuby fails needlessly.
If the deadlock does happen, then we would get this error: deadlock; recursive locking (ThreadError). Which means the program doesn't hang and there is a stacktrace and it shows relevant information which is useful to fix it.

So I think it is just a mistake to fail early with can't be called from trap context (ThreadError).
In fact CRuby even cheats this rule for IO I saw.
There are fully correct use cases for using a Mutex inside trap as I mentioned in the description.

Updated by Eregon (Benoit Daloze) over 2 years ago

https://github.com/ruby/timeout/issues/17#issuecomment-1464039853 seems a proof this restriction is wrong, and actually prevents a correct solution to that issue.

Also this restriction makes no sense for Monitor, since that's reentrant by design.

In fact, one could make their own Mutex using Queue, since Queue is allowed in trap handlers to workaround this restriction. It doesn't make any sense to reimplement Mutex in such an inefficient way of course.

Updated by Eregon (Benoit Daloze) over 2 years ago

ko1 (Koichi Sasada) wrote in #note-7:

Yes. There was a discussion to introduce a thread for signal handlers. But it was not introduced maybe because it is too much for many cases.

I think this is ultimately the proper solution to this problem.
It's extremely hard to reason about a signal handler running on top of any line of code of the main thread.

akr had proposed that if someone want to use such case, how about to make a thread in a trap handler for workaround. I think it is reasonable.

I cannot really use that here, or I would need an efficient way to know I'm in a trap handler (I suppose I could catch the can't be called from trap context (ThreadError) but that's really ugly and likely quite slow).
Because Timeout doesn't define the trap handler, the user does.
And asking every user to workaround like that doesn't seem good, that actually creates more threads than a single signal-handling thread created by Ruby.

Updated by ko1 (Koichi Sasada) over 2 years ago

Eregon (Benoit Daloze) wrote in #note-8:

Right, like I mentioned above.
But the deadlock might not happen (or even cannot happen for some cases I mentioned above), and then CRuby fails needlessly.
If the deadlock does happen, then we would get this error: deadlock; recursive locking (ThreadError). Which means the program doesn't hang and there is a stacktrace and it shows relevant information which is useful to fix it.

My understandings are:

(1) It is hard to detect such deadlock risk because such errors occur with with very low frequency.
(2) It is hard to expect that the Mutex users care about signal handlers.
(3) It is hard to modify fixing with signal handlers.

So I think current limitation makes sense.

akr had proposed that if someone want to use such case, how about to make a thread in a trap handler for workaround. I think it is reasonable.

I cannot really use that here, or I would need an efficient way to know I'm in a trap handler (I suppose I could catch the can't be called from trap context (ThreadError) but that's really ugly and likely quite slow).
Because Timeout doesn't define the trap handler, the user does.
And asking every user to workaround like that doesn't seem good, that actually creates more threads than a single signal-handling thread created by Ruby.

akr's idea is not about Timeout library, but for Timeout users in trap handers.
If someone wants to use Timeout (or other complex/thread-safety needed code), making a thread like trap(...){Thread.new{ ... }} seems feasible workaround.

Updated by Eregon (Benoit Daloze) over 2 years ago

ko1 (Koichi Sasada) wrote in #note-11:

My understandings are:

(1) It is hard to detect such deadlock risk because such errors occur with with very low frequency.
(2) It is hard to expect that the Mutex users care about signal handlers.
(3) It is hard to modify fixing with signal handlers.

So I think current limitation makes sense.

No, it doesn't make sense, because it actually prevents the proper fix for TIMEOUT_THREAD_MUTEX for Timeout:
https://github.com/ruby/timeout/issues/17#issuecomment-1464039853
So the limitation is unsound (it prevents valid and correct usages) and so we must remove it.

Also it doesn't catch try_lock and Queue#pop, etc, so it's very rough and inaccurate.

We could make it a $VERBOSE = true warning maybe, but again it would be false positives and impossible to address, so it seems not good.

Another idea would be being able to mark which Mutex are "trap-safe", maybe with an argument to Mutex#initialize.
Sort of like how IO does it.

ko1 (Koichi Sasada) wrote in #note-11:

akr's idea is not about Timeout library, but for Timeout users in trap handers.
If someone wants to use Timeout (or other complex/thread-safety needed code), making a thread like trap(...){Thread.new{ ... }} seems feasible workaround.

But users might use some gem or some library code in trap handler and they might not know whether that uses Timeout or not.
And even if they know, it should just work, without needing any workaround from the user.

My conclusion is doing nothing is unacceptable, this CRuby limitation breaks valid and correct code such as https://github.com/ruby/timeout/issues/17#issuecomment-1464039853.
So either:

  1. We remove this unsound limitation, same as on other Rubies
  2. We add an argument to Mutex#initialize to mark it as trap-safe
  3. We execute trap handlers on a separate thread

Which one do we choose?

I think 3 is the best (much easier to reason about), but of course there is some potential for incompatibility there.
1 or 2 seem easy, so we could do them fast and maybe even backport it.

Updated by ko1 (Koichi Sasada) over 2 years ago

No, it doesn't make sense, because it actually prevents the proper fix for TIMEOUT_THREAD_MUTEX for Timeout:

I think this reason violates (1).

Updated by Eregon (Benoit Daloze) over 2 years ago

ko1 (Koichi Sasada) wrote in #note-13:

I think this reason violates (1).

What do you mean?
AFAIK https://github.com/ruby/timeout/issues/17#issuecomment-1464039853 is correct.
If it's incorrect, please point the mistake in my reasoning.

We can't have understanding (1) by using solution 1. (remove the trap+Mutex check), indeed those are exclusive.
IMO it's still worth it because we are always breaking when it's only rarely a problem.
So CRuby is causing hard failures for things that might just work.

Here is a list of your 3 understandings and how the 3 solutions address them, I also add (0):
(0) Fixes soundness, i.e. doesn't fail when the code correctly handles signal handler reentrancy

Solutions:

  1. fixes (0)
  2. fixes (0), (1) because no change there, (2) because opt-in
  3. fixes (0), (1), (2), (3)

Updated by ioquatix (Samuel Williams) 2 months ago

I agree the current behaviour seems too limiting. If user code does something incorrect, it should fail, or deadlock. But we should not prevent valid programs just because some users will write invalid programs.

Updated by ioquatix (Samuel Williams) 20 days ago ยท Edited

Folks, what do you think of https://github.com/ruby/ruby/pull/13545?

In my PR, I noticed that IO's internal write_lock explicitly bypasses this limitation. In other words, it shows that there is a need for this behaviour and we have to implement special cases. I'd prefer we don't have any special cases, so I agree even more strongly with (1) We remove this unsound limitation, same as on other Rubies.

Updated by ioquatix (Samuel Williams) 2 days ago

I agree, it would be nice to relax this restriction. But I also understand that it's unpredictable since the trap handler can run at any point the Ruby interpreter checks for interrupts (which for user code is fairly frequent)

As an alternative (to relaxing the restriction), we could expose the existing logic/flag, e.g. Mutex#safe_in_trap_context = true/false. It's a bit ugly, but it's one idea that might allow us to move forward while retaining the default existing behaviour.

It's extremely hard to reason about a signal handler running on top of any line of code of the main thread.

In Async, we handle this using Thread.handle_interrupt to defer interrupts until a specific known point in the event loop execution: https://github.com/socketry/async/blob/4206e32da4e04b65a4074080f9a59d66b827a347/lib/async/scheduler.rb#L520-L540. This helps to make interrupt handling more predictable. This makes asynchronous interrupts a little bit safer, IMHO.

There seems to be no valid reason to prevent all Mutex#synchronize in trap.

pthread_mutex_ functions are not async-signal-safe. However, Ruby does not run code directly in the signal handler, so it's not the same risk - but I feel like in principle it's still the same semantic issue. So, it's not entirely clear to me that it's safe to use any kind of blocking operation in principle, in trap handlers.

Updated by ioquatix (Samuel Williams) 2 days ago

As one other idea, what about introducing Mutex#reentrant = true/false. This seems like a more general model than safe_in_trap_context and I think we could then allow reentrant mutex to be used safely in trap contexts. The user would be explicitly opting into that scenario.

Updated by Eregon (Benoit Daloze) 2 days ago

I think a simple way to look at this issue is to analyze all cases:

  1. The Mutex is not held by any Thread, and the signal handler acquires it (the common case): this is fully correct and would work just fine without the unnecessary can't be called from trap context (ThreadError) from CRuby which breaks this.
  2. The Mutex is held by the main Thread, and the signal handler tries to acquire it: it would be deadlock; recursive locking (ThreadError) which is as good as the current can't be called from trap context (ThreadError).
  3. The Mutex is held by another Thread, and the signal handler tries to acquire it: the signal handler will wait until that Thread releases the Mutex. This is totally normal, it is Mutex semantics. You can't argue the Thread might keep the Mutex forever because that would be a bug in the first place. In the worst case it hangs, the user would Ctrl+C and see the stacktrace, which is a reasonable outcome for that case. CRuby is even sometimes able to detect such a deadlock and raises a fatal error in that case.

So removing the check is clearly better for the 1st case, is equivalent for the 2nd case, and is better for the 3rd case (at least for the subcase where the other Thread doesn't hold the Mutex forever).
If anyone disagrees, can they explain how it is better for all of these cases to fail with can't be called from trap context (ThreadError)?

Actions

Also available in: Atom PDF

Like1
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0