Bug #15430: test_fork_while_parent_locked is failing status on Ruby CI - Ruby - Ruby Issue Tracking System

This fixes "[BUG] invalid keeping_mutexes: Attempt to unlock a
mutex which is locked by another thread" and should
fix test_fork_while_parent_locked failures in CI

[ruby-core:90581] [Bug #15424]
[ruby-core:90595] [Bug #15430]

Fixes: r66230 ("handle mutexes held by parent threads in children")

Actions

Copy link

#3 [ruby-core:90601]

Updated by normalperson (Eric Wong) over 6 years ago

r66438 should fix it. Ugh, I clearly shouldn't be allowed near
a keyboard anymore :<

Actions

Copy link

#4 [ruby-core:90614]

Updated by hsbt (Hiroshi SHIBATA) over 6 years ago

Status changed from Closed to Assigned

@normalperson (Eric Wong)

It still exists. https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181218T230003Z.fail.html.gz

Actions

Copy link

#5 [ruby-core:90630]

Updated by normalperson (Eric Wong) over 6 years ago

It still exists. https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181218T230003Z.fail.html.gz

:< I am now thinking the r66230 for [Bug #15383] is
unnecessary and [Bug #15383] is a bug somewhere else.

But I cant reproduce it atm...

Actions

Copy link

#6 [ruby-core:90634]

Updated by normalperson (Eric Wong) over 6 years ago

But I cant reproduce it atm...

OK, I need a break and get some sleep (but I realize 2.6 release is
is very soon).

I can reproduce [Bug #15383] with "rb_mutex_t.fork_gen" field
removed, but fork_gen shouldn't be necessary in mutex because of
rb_mutex_abandon_* functions and rb_mutex_cleanup_keeping_mutexes
already call list_head_init(waitq).

So there is a bug in one of the rb_mutex_{abandon,cleanup}*
functions which is hidden by the addition of rb_mutex_t.fork_gen.

Actions

Copy link

Updated by normalperson (Eric Wong) over 6 years ago

Status changed from Assigned to Closed

Applied in changeset trunk|r66477.

thread_sync.c (rb_mutex_cleanup_keeping_mutexes): update fork_gen

... when clearing waitq. Otherwise, we risk redundantly clearing
valid waiters in future calls to mutex_ptr.

Note: I am not sure if this fixes [Bug #15430], and even if it
did, fork_gen is a belt-and-suspenders redundancy for [Bug #15383]
which wastes one word for every Mutex object.

Actions

Copy link

#8 [ruby-core:90669]

Updated by normalperson (Eric Wong) over 6 years ago

Fwiw, I think the bug is in rb_mutex_abandon_locking_mutex,
it needs to clear the waitq unconditionally; but not touch
mutex->th or mutex->next_mutex:

https://80x24.org/spew/20181221223422.9219-1-e@80x24.org/raw

And as a result, rb_mutex_cleanup_keeping_mutexes is redundant.

Actions

Copy link

#9 [ruby-core:90670]

Updated by k0kubun (Takashi Kokubun) over 6 years ago

Status changed from Closed to Assigned

As of r66490, the test on some CIs is still failing for different reasons.

Expected "#<Process::Status: pid 515 SIGKILL (signal 9)>" to be empty.

https://rubyci.org/logs/rubyci.s3.amazonaws.com/debian8/ruby-trunk/log/20181222T063002Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/gentoo/ruby-trunk/log/20181222T063003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181222T050003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/fedora27/ruby-trunk/log/20181222T063002Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181222T060002Z.fail.html.gz

ArgumentError: invalid resource name: NPROC

https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable10x/ruby-trunk/log/20181222T051806Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable11x/ruby-trunk/log/20181222T052407Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable10s/ruby-trunk/log/20181222T001911Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable11s/ruby-trunk/log/20181222T052505Z.fail.html.gz

Could you take a look at it again?

Actions

Copy link

#10

Updated by normalperson (Eric Wong) over 6 years ago

Status changed from Assigned to Closed

Applied in changeset trunk|r66494.

test/ruby/test_thread.rb (test_fork_while_parent_locked): reduce threads

Reduce thread counts unconditionally for some CI systems with
low limits.. And Solaris apparently lacks RLIMIT_NPROC, so we
can't detect resource limits and scale the test appropriately.

[ruby-core:90670] [Bug #15430]

Actions

Copy link

#11 [ruby-core:90671]

Updated by normalperson (Eric Wong) over 6 years ago

takashikkbn@gmail.com wrote:

As of r66490, the test on some CIs is still failing for different reasons.

Expected "#<Process::Status: pid 515 SIGKILL (signal 9)>" to be empty.

OK, I lowered thread counts in r66494.

Btw, can we get dmesg or kernel logs from those CI machines to
confirm it was OOM killer? I guess RLIMIT_NPROC was a bad check
anyways; and OOM killer is a little unpredictable...

ArgumentError: invalid resource name: NPROC

Interesting; I haven't touched Solaris in ages but I never
thought it lacked RLIMIT_NPROC. So I unconditionally removed
the check.

Actions

Copy link

#12 [ruby-core:90676]

Updated by k0kubun (Takashi Kokubun) over 6 years ago

Status changed from Closed to Assigned

As of r66497 or r66498, some Linux environments are getting:

  1) Failure:
TestThread#test_fork_while_parent_locked [/home/hsbt/chkbuild/tmp/build/20181222T213003Z/ruby/test/ruby/test_thread.rb:1278]:
lower `nr' if SIGKILL because of RLIMIT_NPROC limit.
Expected "#<Process::Status: pid 17656 SIGKILL (signal 9)>" to be empty.

https://rubyci.org/logs/rubyci.s3.amazonaws.com/debian8/ruby-trunk/log/20181222T213003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181222T230003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181222T210001Z.fail.html.gz

So reopening this. If it's just a bad check, please consider eliminating the check or skipping the test itself to avoid false-positive alarms.

Btw, can we get dmesg or kernel logs from those CI machines to confirm it was OOM killer?

I don't have the access to most of RubyCI servers. Please contact @hsbt (Hiroshi SHIBATA), @naruse (Yui NARUSE), or possibly @akr (Akira Tanaka) about those servers.

Actions

Copy link

#13

Updated by normalperson (Eric Wong) over 6 years ago

Status changed from Assigned to Closed

Applied in changeset trunk|r66508.

test/ruby/test_thread.rb (test_fork_while_parent_locked): rewrite to avoid OOM

Instead of using a torture test, trigger the condition for the old
segfault in [Bug #15383] exactly.

[ruby-core:90676] [Bug #15430]

Actions

Copy link

#14 [ruby-core:90680]

Updated by normalperson (Eric Wong) over 6 years ago

So reopening this. If it's just a bad check, please consider
eliminating the check or skipping the test itself to avoid
false-positive alarms.

OK, I rewrote the test to surgically trigger the condition
r66508. Feel free to delete the test if it still fails; I don't
think I'll be around much more.

Actions

Copy link

#15 [ruby-core:90682]

Updated by k0kubun (Takashi Kokubun) over 6 years ago

It seems to be all green now. Thank you for your help.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #15430

test_fork_while_parent_locked is failing status on Ruby CI

Updated by normalperson (Eric Wong) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by hsbt (Hiroshi SHIBATA) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by k0kubun (Takashi Kokubun) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by k0kubun (Takashi Kokubun) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by normalperson (Eric Wong) over 6 years ago

Updated by k0kubun (Takashi Kokubun) over 6 years ago