Project

General

Profile

Bug #15383

Reproducible crash: crash.rb:6: [BUG] unexpected THREAD_KILLED

Added by mbjs (Markus Schirp) about 1 year ago. Updated 12 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]
[ruby-core:90312]

Description

Hi,

I'm reporting a reliable crash of the ruby interpreter on contested mutexeses that are accessed in child processes.

I currently think that this happens as the child processes main thread, may be waiting for a parent process sibling thread that was holding the mutex at the time of the fork. After the fork is done, all sibling threads are dead, and the mutex detects the attempt to wait for a dead thread, bailing out.

This is simular, but not identical to the case here: https://bugs.ruby-lang.org/issues/14578

Here is a gist with some more test results on various platforms: https://gist.github.com/mbj/e6795ee5e0583c5541ee250e9942279a

I'm fine to get my hands dirty, but would need some pointers if my above conclusion points to the right direction.

Best,

Markus


Files

output.txt (21.3 KB) output.txt Crash log mbjs (Markus Schirp), 12/05/2018 01:33 PM
crash.rb (200 Bytes) crash.rb Reproduction script (super small) mbjs (Markus Schirp), 12/05/2018 01:33 PM

Related issues

Related to Ruby master - Bug #14634: Queue#push seems to crash after forkClosedActions

Associated revisions

Revision 818f1c65
Added by normal about 1 year ago

thread_sync.c (mutex_ptr): handle mutexes held by parent threads in children

Mutexes may be held by threads which only exist in the parent
process, so their waitqueues may be populated with references
to other dead threads. We must reset them at fork.

I am a moron for introducing this bug :<

[ruby-core:90312] [Bug #15383]

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66230 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Revision 66230
Added by normalperson (Eric Wong) about 1 year ago

thread_sync.c (mutex_ptr): handle mutexes held by parent threads in children

Mutexes may be held by threads which only exist in the parent
process, so their waitqueues may be populated with references
to other dead threads. We must reset them at fork.

I am a moron for introducing this bug :<

[ruby-core:90312] [Bug #15383]

Revision 66230
Added by normal about 1 year ago

thread_sync.c (mutex_ptr): handle mutexes held by parent threads in children

Mutexes may be held by threads which only exist in the parent
process, so their waitqueues may be populated with references
to other dead threads. We must reset them at fork.

I am a moron for introducing this bug :<

[ruby-core:90312] [Bug #15383]

Revision fa5601e7
Added by normal 12 months ago

thread_sync.c (rb_mutex_cleanup_keeping_mutexes): update fork_gen

... when clearing waitq. Otherwise, we risk redundantly clearing
valid waiters in future calls to mutex_ptr.

Note: I am not sure if this fixes [Bug #15430], and even if it
did, fork_gen is a belt-and-suspenders redundancy for [Bug #15383]
which wastes one word for every Mutex object.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66477 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Revision 66477
Added by normalperson (Eric Wong) 12 months ago

thread_sync.c (rb_mutex_cleanup_keeping_mutexes): update fork_gen

... when clearing waitq. Otherwise, we risk redundantly clearing
valid waiters in future calls to mutex_ptr.

Note: I am not sure if this fixes [Bug #15430], and even if it
did, fork_gen is a belt-and-suspenders redundancy for [Bug #15383]
which wastes one word for every Mutex object.

Revision 66477
Added by normal 12 months ago

thread_sync.c (rb_mutex_cleanup_keeping_mutexes): update fork_gen

... when clearing waitq. Otherwise, we risk redundantly clearing
valid waiters in future calls to mutex_ptr.

Note: I am not sure if this fixes [Bug #15430], and even if it
did, fork_gen is a belt-and-suspenders redundancy for [Bug #15383]
which wastes one word for every Mutex object.

Revision 0fd53f51
Added by normal 12 months ago

thread_sync.c (rb_mutex_t): eliminate fork_gen

The true bug fork_gen was hiding was rb_mutex_abandon_locking_mutex
failing to unconditionally clear the waitq of mutexes it was
waiting on. So we fix rb_mutex_abandon_locking_mutex, instead,
and eliminate rb_mutex_cleanup_keeping_mutexes.

This commit was tested heavily on a single-core Pentium-M which
was my most reliable reproducer of the "crash.rb" script from
[Bug #15383]

[Bug #14578] [Bug #15383]

Note: [Bug #15430] turned out to be an entirely different
problem: RLIMIT_NPROC limit was hit on the CI VMs.

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66489 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Revision 66489
Added by normalperson (Eric Wong) 12 months ago

thread_sync.c (rb_mutex_t): eliminate fork_gen

The true bug fork_gen was hiding was rb_mutex_abandon_locking_mutex
failing to unconditionally clear the waitq of mutexes it was
waiting on. So we fix rb_mutex_abandon_locking_mutex, instead,
and eliminate rb_mutex_cleanup_keeping_mutexes.

This commit was tested heavily on a single-core Pentium-M which
was my most reliable reproducer of the "crash.rb" script from
[Bug #15383]

[Bug #14578] [Bug #15383]

Note: [Bug #15430] turned out to be an entirely different
problem: RLIMIT_NPROC limit was hit on the CI VMs.

Revision 66489
Added by normal 12 months ago

thread_sync.c (rb_mutex_t): eliminate fork_gen

The true bug fork_gen was hiding was rb_mutex_abandon_locking_mutex
failing to unconditionally clear the waitq of mutexes it was
waiting on. So we fix rb_mutex_abandon_locking_mutex, instead,
and eliminate rb_mutex_cleanup_keeping_mutexes.

This commit was tested heavily on a single-core Pentium-M which
was my most reliable reproducer of the "crash.rb" script from
[Bug #15383]

[Bug #14578] [Bug #15383]

Note: [Bug #15430] turned out to be an entirely different
problem: RLIMIT_NPROC limit was hit on the CI VMs.

Revision 37bba27b
Added by normal 12 months ago

test/ruby/test_thread.rb (test_fork_while_parent_locked): rewrite to avoid OOM

Instead of using a torture test, trigger the condition for the old
segfault in [Bug #15383] exactly.

[ruby-core:90676] [Bug #15430]

git-svn-id: svn+ssh://ci.ruby-lang.org/ruby/trunk@66508 b2dd03c8-39d4-4d8f-98ff-823fe69b080e

Revision 66508
Added by normalperson (Eric Wong) 12 months ago

test/ruby/test_thread.rb (test_fork_while_parent_locked): rewrite to avoid OOM

Instead of using a torture test, trigger the condition for the old
segfault in [Bug #15383] exactly.

[ruby-core:90676] [Bug #15430]

Revision 66508
Added by normal 12 months ago

test/ruby/test_thread.rb (test_fork_while_parent_locked): rewrite to avoid OOM

Instead of using a torture test, trigger the condition for the old
segfault in [Bug #15383] exactly.

[ruby-core:90676] [Bug #15430]

History

#1

Updated by normalperson (Eric Wong) about 1 year ago

  • Backport changed from 2.4: UNKNOWN, 2.5: UNKNOWN to 2.4: UNKNOWN, 2.5: REQUIRED
#2

Updated by normalperson (Eric Wong) about 1 year ago

  • Status changed from Open to Closed

Applied in changeset trunk|r66230.


thread_sync.c (mutex_ptr): handle mutexes held by parent threads in children

Mutexes may be held by threads which only exist in the parent
process, so their waitqueues may be populated with references
to other dead threads. We must reset them at fork.

I am a moron for introducing this bug :<

[ruby-core:90312] [Bug #15383]

Updated by normalperson (Eric Wong) about 1 year ago

https://bugs.ruby-lang.org/issues/15383

Thanks, it affects trunk; just more difficult to reproduce
because of thread cache.

I'm a moron for not noticing this when I fixed other bugs :<

r66230 should fix it in trunk and should be backported
(but r66229 is independently broken and I just reverted it for now)

#4

Updated by mbjs (Markus Schirp) about 1 year ago

  • Subject changed from Reproducible crash: crash.sh:6: [BUG] unexpected THREAD_KILLED to Reproducible crash: crash.rb:6: [BUG] unexpected THREAD_KILLED

Updated by mbjs (Markus Schirp) about 1 year ago

Thanks for the quick fix. Also for marking the fix to be backported.

Just curious, is there an associated CI build for these changes?

Updated by normalperson (Eric Wong) about 1 year ago

mbj@schirp-dso.com wrote:

Just curious, is there an associated CI build for these changes?

I check https://rubyci.org/ and http://ci.rvm.jp/ (and get
automated mails from the latter).

There's also TravisCI; but I don't use JavaScript; so I rely
on others giving me URLs to the raw logs.

#7

Updated by nagachika (Tomoyuki Chikanaga) about 1 year ago

  • Related to Bug #14634: Queue#push seems to crash after fork added

Updated by normalperson (Eric Wong) 12 months ago

https://bugs.ruby-lang.org/issues/15383

r66230 should fix it in trunk and should be backported

No, actually. r66230 hides an existing problem in the fix
for https://bugs.ruby-lang.org/issues/14578
...
Still working on this and my head hurts :<

Also available in: Atom PDF