Project

General

Profile

Actions

Bug #15430

closed

test_fork_while_parent_locked is failing status on Ruby CI

Added by hsbt (Hiroshi SHIBATA) about 6 years ago. Updated almost 6 years ago.

Status:
Closed
Target version:
-
[ruby-core:90595]

Updated by normalperson (Eric Wong) about 6 years ago

Thanks. Maybe related to [Bug #15430]. Will investigate in coming days.

Actions #2

Updated by normalperson (Eric Wong) about 6 years ago

  • Status changed from Assigned to Closed

Applied in changeset trunk|r66438.


thread_sync.c (mutex_ptr): only reinitalize waitqueue at fork

Mutexes need to remain locked after forking.

This fixes "[BUG] invalid keeping_mutexes: Attempt to unlock a
mutex which is locked by another thread" and should
fix test_fork_while_parent_locked failures in CI

[ruby-core:90581] [Bug #15424]
[ruby-core:90595] [Bug #15430]

Fixes: r66230 ("handle mutexes held by parent threads in children")

Updated by normalperson (Eric Wong) about 6 years ago

r66438 should fix it. Ugh, I clearly shouldn't be allowed near
a keyboard anymore :<

Updated by normalperson (Eric Wong) about 6 years ago

It still exists. https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181218T230003Z.fail.html.gz

:< I am now thinking the r66230 for [Bug #15383] is
unnecessary and [Bug #15383] is a bug somewhere else.

But I cant reproduce it atm...

Updated by normalperson (Eric Wong) about 6 years ago

But I cant reproduce it atm...

OK, I need a break and get some sleep (but I realize 2.6 release is
is very soon).

I can reproduce [Bug #15383] with "rb_mutex_t.fork_gen" field
removed, but fork_gen shouldn't be necessary in mutex because of
rb_mutex_abandon_* functions and rb_mutex_cleanup_keeping_mutexes
already call list_head_init(waitq).

So there is a bug in one of the rb_mutex_{abandon,cleanup}*
functions which is hidden by the addition of rb_mutex_t.fork_gen.

Actions #7

Updated by normalperson (Eric Wong) about 6 years ago

  • Status changed from Assigned to Closed

Applied in changeset trunk|r66477.


thread_sync.c (rb_mutex_cleanup_keeping_mutexes): update fork_gen

... when clearing waitq. Otherwise, we risk redundantly clearing
valid waiters in future calls to mutex_ptr.

Note: I am not sure if this fixes [Bug #15430], and even if it
did, fork_gen is a belt-and-suspenders redundancy for [Bug #15383]
which wastes one word for every Mutex object.

Updated by normalperson (Eric Wong) about 6 years ago

Fwiw, I think the bug is in rb_mutex_abandon_locking_mutex,
it needs to clear the waitq unconditionally; but not touch
mutex->th or mutex->next_mutex:

https://80x24.org/spew/20181221223422.9219-1-e@80x24.org/raw

And as a result, rb_mutex_cleanup_keeping_mutexes is redundant.

Actions #10

Updated by normalperson (Eric Wong) about 6 years ago

  • Status changed from Assigned to Closed

Applied in changeset trunk|r66494.


test/ruby/test_thread.rb (test_fork_while_parent_locked): reduce threads

Reduce thread counts unconditionally for some CI systems with
low limits.. And Solaris apparently lacks RLIMIT_NPROC, so we
can't detect resource limits and scale the test appropriately.

[ruby-core:90670] [Bug #15430]

Updated by normalperson (Eric Wong) about 6 years ago

wrote:

As of r66490, the test on some CIs is still failing for different reasons.

Expected "#<Process::Status: pid 515 SIGKILL (signal 9)>" to be empty.

OK, I lowered thread counts in r66494.

Btw, can we get dmesg or kernel logs from those CI machines to
confirm it was OOM killer? I guess RLIMIT_NPROC was a bad check
anyways; and OOM killer is a little unpredictable...

ArgumentError: invalid resource name: NPROC

Interesting; I haven't touched Solaris in ages but I never
thought it lacked RLIMIT_NPROC. So I unconditionally removed
the check.

Updated by k0kubun (Takashi Kokubun) about 6 years ago

  • Status changed from Closed to Assigned

As of r66497 or r66498, some Linux environments are getting:

  1) Failure:
TestThread#test_fork_while_parent_locked [/home/hsbt/chkbuild/tmp/build/20181222T213003Z/ruby/test/ruby/test_thread.rb:1278]:
lower `nr' if SIGKILL because of RLIMIT_NPROC limit.
Expected "#<Process::Status: pid 17656 SIGKILL (signal 9)>" to be empty.

https://rubyci.org/logs/rubyci.s3.amazonaws.com/debian8/ruby-trunk/log/20181222T213003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181222T230003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181222T210001Z.fail.html.gz

So reopening this. If it's just a bad check, please consider eliminating the check or skipping the test itself to avoid false-positive alarms.

Btw, can we get dmesg or kernel logs from those CI machines to confirm it was OOM killer?

I don't have the access to most of RubyCI servers. Please contact @hsbt (Hiroshi SHIBATA), @naruse (Yui NARUSE), or possibly @akr (Akira Tanaka) about those servers.

Actions #13

Updated by normalperson (Eric Wong) almost 6 years ago

  • Status changed from Assigned to Closed

Applied in changeset trunk|r66508.


test/ruby/test_thread.rb (test_fork_while_parent_locked): rewrite to avoid OOM

Instead of using a torture test, trigger the condition for the old
segfault in [Bug #15383] exactly.

[ruby-core:90676] [Bug #15430]

Updated by normalperson (Eric Wong) almost 6 years ago

So reopening this. If it's just a bad check, please consider
eliminating the check or skipping the test itself to avoid
false-positive alarms.

OK, I rewrote the test to surgically trigger the condition
r66508. Feel free to delete the test if it still fails; I don't
think I'll be around much more.

Updated by k0kubun (Takashi Kokubun) almost 6 years ago

It seems to be all green now. Thank you for your help.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0