Bug #15430: test_fork_while_parent_locked is failing status on Ruby CI - Ruby - Ruby Issue Tracking System

Custom queries

Backport 3.2
Backport 3.3
Backport 3.4
Backport 4.0
bugs: unassigned
DevMeeting
matz
Open issues with attachment
Windows

Actions

Copy link

Bug #15430

closed

test_fork_while_parent_locked is failing status on Ruby CI

Bug #15430: test_fork_while_parent_locked is failing status on Ruby CI

Added by hsbt (Hiroshi SHIBATA) about 7 years ago. Updated about 7 years ago.

Status:

Closed

Assignee:

normalperson (Eric Wong)

Target version:

ruby -v:

Backport:

2.4: UNKNOWN, 2.5: UNKNOWN

[ruby-core:90595]

Description

After r66230, TestThread#test_fork_while_parent_locked is failing status on Ruby CI.

See

CentOS 7 https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181217T230002Z.fail.html.gz
Amazon Lunux 2 https://rubyci.org/logs/rubyci.s3.amazonaws.com/amazon2/ruby-trunk/log/20181218T003003Z.fail.html.gz
Arch Linux https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181218T000002Z.fail.html.gz

History
Notes
Property changes
Associated revisions

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#1 [ruby-core:90598]

Thanks. Maybe related to [Bug #15430]. Will investigate in coming days.

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#2

Status changed from Assigned to Closed

Applied in changeset trunk|r66438.

thread_sync.c (mutex_ptr): only reinitalize waitqueue at fork

Mutexes need to remain locked after forking.

This fixes "[BUG] invalid keeping_mutexes: Attempt to unlock a
mutex which is locked by another thread" and should
fix test_fork_while_parent_locked failures in CI

[ruby-core:90581] [Bug #15424]
[ruby-core:90595] [Bug #15430]

Fixes: r66230 ("handle mutexes held by parent threads in children")

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#3 [ruby-core:90601]

r66438 should fix it. Ugh, I clearly shouldn't be allowed near
a keyboard anymore :<

Updated by hsbt (Hiroshi SHIBATA) about 7 years ago Actions
Copy link
#4 [ruby-core:90614]

Status changed from Closed to Assigned

@normalperson (Eric Wong)

It still exists. https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181218T230003Z.fail.html.gz

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#5 [ruby-core:90630]

It still exists. https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181218T230003Z.fail.html.gz

:< I am now thinking the r66230 for [Bug #15383] is
unnecessary and [Bug #15383] is a bug somewhere else.

But I cant reproduce it atm...

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#6 [ruby-core:90634]

But I cant reproduce it atm...

OK, I need a break and get some sleep (but I realize 2.6 release is
is very soon).

I can reproduce [Bug #15383] with "rb_mutex_t.fork_gen" field
removed, but fork_gen shouldn't be necessary in mutex because of
rb_mutex_abandon_* functions and rb_mutex_cleanup_keeping_mutexes
already call list_head_init(waitq).

So there is a bug in one of the rb_mutex_{abandon,cleanup}*
functions which is hidden by the addition of rb_mutex_t.fork_gen.

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#7

Status changed from Assigned to Closed

Applied in changeset trunk|r66477.

thread_sync.c (rb_mutex_cleanup_keeping_mutexes): update fork_gen

... when clearing waitq. Otherwise, we risk redundantly clearing
valid waiters in future calls to mutex_ptr.

Note: I am not sure if this fixes [Bug #15430], and even if it
did, fork_gen is a belt-and-suspenders redundancy for [Bug #15383]
which wastes one word for every Mutex object.

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#8 [ruby-core:90669]

Fwiw, I think the bug is in rb_mutex_abandon_locking_mutex,
it needs to clear the waitq unconditionally; but not touch
mutex->th or mutex->next_mutex:

https://80x24.org/spew/20181221223422.9219-1-e@80x24.org/raw

And as a result, rb_mutex_cleanup_keeping_mutexes is redundant.

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#9 [ruby-core:90670]

Status changed from Closed to Assigned

As of r66490, the test on some CIs is still failing for different reasons.

Expected "#<Process::Status: pid 515 SIGKILL (signal 9)>" to be empty.

https://rubyci.org/logs/rubyci.s3.amazonaws.com/debian8/ruby-trunk/log/20181222T063002Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/gentoo/ruby-trunk/log/20181222T063003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181222T050003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/fedora27/ruby-trunk/log/20181222T063002Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181222T060002Z.fail.html.gz

ArgumentError: invalid resource name: NPROC

https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable10x/ruby-trunk/log/20181222T051806Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable11x/ruby-trunk/log/20181222T052407Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable10s/ruby-trunk/log/20181222T001911Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable11s/ruby-trunk/log/20181222T052505Z.fail.html.gz

Could you take a look at it again?

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#10

Status changed from Assigned to Closed

Applied in changeset trunk|r66494.

test/ruby/test_thread.rb (test_fork_while_parent_locked): reduce threads

Reduce thread counts unconditionally for some CI systems with
low limits.. And Solaris apparently lacks RLIMIT_NPROC, so we
can't detect resource limits and scale the test appropriately.

[ruby-core:90670] [Bug #15430]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#11 [ruby-core:90671]

takashikkbn@gmail.com wrote:

As of r66490, the test on some CIs is still failing for different reasons.

Expected "#<Process::Status: pid 515 SIGKILL (signal 9)>" to be empty.

OK, I lowered thread counts in r66494.

Btw, can we get dmesg or kernel logs from those CI machines to
confirm it was OOM killer? I guess RLIMIT_NPROC was a bad check
anyways; and OOM killer is a little unpredictable...

ArgumentError: invalid resource name: NPROC

Interesting; I haven't touched Solaris in ages but I never
thought it lacked RLIMIT_NPROC. So I unconditionally removed
the check.

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#12 [ruby-core:90676]

Status changed from Closed to Assigned

As of r66497 or r66498, some Linux environments are getting:

  1) Failure:
TestThread#test_fork_while_parent_locked [/home/hsbt/chkbuild/tmp/build/20181222T213003Z/ruby/test/ruby/test_thread.rb:1278]:
lower `nr' if SIGKILL because of RLIMIT_NPROC limit.
Expected "#<Process::Status: pid 17656 SIGKILL (signal 9)>" to be empty.

https://rubyci.org/logs/rubyci.s3.amazonaws.com/debian8/ruby-trunk/log/20181222T213003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181222T230003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181222T210001Z.fail.html.gz

So reopening this. If it's just a bad check, please consider eliminating the check or skipping the test itself to avoid false-positive alarms.

Btw, can we get dmesg or kernel logs from those CI machines to confirm it was OOM killer?

I don't have the access to most of RubyCI servers. Please contact @hsbt (Hiroshi SHIBATA), @naruse (Yui NARUSE), or possibly @akr (Akira Tanaka) about those servers.

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#13

Status changed from Assigned to Closed

Applied in changeset trunk|r66508.

test/ruby/test_thread.rb (test_fork_while_parent_locked): rewrite to avoid OOM

Instead of using a torture test, trigger the condition for the old
segfault in [Bug #15383] exactly.

[ruby-core:90676] [Bug #15430]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#14 [ruby-core:90680]

So reopening this. If it's just a bad check, please consider
eliminating the check or skipping the test itself to avoid
false-positive alarms.

OK, I rewrote the test to surgically trigger the condition
r66508. Feel free to delete the test if it still fails; I don't
think I'll be around much more.

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#15 [ruby-core:90682]

It seems to be all green now. Thank you for your help.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #15430

test_fork_while_parent_locked is failing status on Ruby CI

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#1 [ruby-core:90598]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#2

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#3 [ruby-core:90601]

Updated by hsbt (Hiroshi SHIBATA) about 7 years ago Actions
Copy link
#4 [ruby-core:90614]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#5 [ruby-core:90630]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#6 [ruby-core:90634]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#7

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#8 [ruby-core:90669]

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#9 [ruby-core:90670]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#10

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#11 [ruby-core:90671]

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#12 [ruby-core:90676]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#13

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#14 [ruby-core:90680]

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#15 [ruby-core:90682]

Project

General

Profile

Ruby

Custom queries

Bug #15430

test_fork_while_parent_locked is failing status on Ruby CI

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #1 [ruby-core:90598]

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #2

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #3 [ruby-core:90601]

Updated by hsbt (Hiroshi SHIBATA) about 7 years ago ActionsCopy link #4 [ruby-core:90614]

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #5 [ruby-core:90630]

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #6 [ruby-core:90634]

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #7

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #8 [ruby-core:90669]

Updated by k0kubun (Takashi Kokubun) about 7 years ago ActionsCopy link #9 [ruby-core:90670]

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #10

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #11 [ruby-core:90671]

Updated by k0kubun (Takashi Kokubun) about 7 years ago ActionsCopy link #12 [ruby-core:90676]

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #13

Updated by normalperson (Eric Wong) about 7 years ago ActionsCopy link #14 [ruby-core:90680]

Updated by k0kubun (Takashi Kokubun) about 7 years ago ActionsCopy link #15 [ruby-core:90682]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#1 [ruby-core:90598]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#2

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#3 [ruby-core:90601]

Updated by hsbt (Hiroshi SHIBATA) about 7 years ago Actions
Copy link
#4 [ruby-core:90614]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#5 [ruby-core:90630]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#6 [ruby-core:90634]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#7

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#8 [ruby-core:90669]

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#9 [ruby-core:90670]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#10

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#11 [ruby-core:90671]

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#12 [ruby-core:90676]

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#13

Updated by normalperson (Eric Wong) about 7 years ago Actions
Copy link
#14 [ruby-core:90680]

Updated by k0kubun (Takashi Kokubun) about 7 years ago Actions
Copy link
#15 [ruby-core:90682]