Bug #21926
closedThread#value on popen3 wait thread hangs in finalizer
Description
Calling Thread#value on an Open3.popen3 wait thread from a finalizer completes in Ruby 3.2 but hangs in Ruby 3.3+. See repro.rb below. When the Ruby process hangs in these conditions, it no longer responds to signals and it seems to be unable to run any other threads.
This affects the schmooze gem (and potentially other code using Open3.popen3 with finalizers), causing test suites to hang intermittently.
# repro.rb
require 'open3'
class ProcessWrapper
def initialize
@stdin, @stdout, @stderr, @wait_thread = Open3.popen3("cat")
ObjectSpace.define_finalizer(self, self.class.make_finalizer(@stdin, @stdout, @stderr, @wait_thread))
end
def self.make_finalizer(stdin, stdout, stderr, wait_thread)
proc do
stdin.close rescue nil
stdout.close rescue nil
stderr.close rescue nil
wait_thread.value # Hangs here in Ruby 3.3+
end
end
end
100.times { ProcessWrapper.new }
GC.stress = true
1000.times { Object.new }
puts "done"
Environment¶
- Linux x86_64
- Tested on Ruby 3.2.7, 3.3.7, 3.4.8
Updated by luke-gru (Luke Gruber) 14 days ago
ยท Edited
- Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN to 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: REQUIRED, 4.0: REQUIRED
Thank you for the nice reproduction! We have a fix coming, and it will probably be backported to 3.3, 3.4 and 4.0.
The fix is at https://github.com/ruby/ruby/pull/16307.
Updated by luke-gru (Luke Gruber) 14 days ago
- Assignee set to luke-gru (Luke Gruber)
Updated by luke-gru (Luke Gruber) 10 days ago
- Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: REQUIRED, 4.0: REQUIRED to 3.2: UNKNOWN, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED
Updated by Anonymous 10 days ago
- Status changed from Open to Closed
Applied in changeset git|08372635f7ec09f7115bd254246ebd637499651c.
Fix race condition right after ubf registration
Registering a ubf was considered problematic in some cases because it could
result in lock ordering inversions with the ubf function itself. I believe
this is the reason that in patch be1bbd5b7d, the ubf was registered outside of
the thread_sched_lock.
For example, thread_sched_to_waiting_until_wakeup() (native_sleep)
should register the ubf with the thread_sched_lock (TSL) taken. The ordering
is (TSL -> UBF lock). During the ubf call itself, since the UBF lock is
always acquired during the call, the ordering is (UBF lock -> TSL). This
inversion was avoided by registering the ubf outside the TSL.
This inversion is benign in this case, though. Since there can be no ubf on the
thread before the call to ubf_set, and this ubf is the only place where the
(UBF lock -> TSL lock) ordering is done, it is okay to register it with the TSL.
In fact registering the ubf outside the TSL can result in problamatic race conditions,
so we now register the ubf with the TSL lock.
This fixes a race condition whereby a newly registered ubf is called before the thing it's
protecting (the sleep) is actually run. The sleeping thread can never be interrupted if this
happens.
Fixes [Bug #21926]
Updated by luke-gru (Luke Gruber) 9 days ago
A note to whoever backports this (which could be me, I'm just waiting ~ 1 week and checking various CIs after the merge):
https://github.com/ruby/ruby/pull/16362 should be backported as well, they are really 1 fix in 2 separate commits.
Updated by k0kubun (Takashi Kokubun) 3 days ago
- Backport changed from 3.2: UNKNOWN, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED to 3.2: UNKNOWN, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: DONE
ruby_4_0 a601b899a35c796775309dca01a6d5e64be14c44 merged revision(s) 08372635f7ec09f7115bd254246ebd637499651c.