Bug #15430
closedtest_fork_while_parent_locked is failing status on Ruby CI
Description
After r66230, TestThread#test_fork_while_parent_locked is failing status on Ruby CI.
See
- CentOS 7 https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181217T230002Z.fail.html.gz
- Amazon Lunux 2 https://rubyci.org/logs/rubyci.s3.amazonaws.com/amazon2/ruby-trunk/log/20181218T003003Z.fail.html.gz
- Arch Linux https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181218T000002Z.fail.html.gz
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      Thanks. Maybe related to [Bug #15430]. Will investigate in coming days.
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      - Status changed from Assigned to Closed
Applied in changeset trunk|r66438.
thread_sync.c (mutex_ptr): only reinitalize waitqueue at fork
Mutexes need to remain locked after forking.
This fixes "[BUG] invalid keeping_mutexes: Attempt to unlock a
mutex which is locked by another thread" and should
fix test_fork_while_parent_locked failures in CI
[ruby-core:90581] [Bug #15424]
[ruby-core:90595] [Bug #15430]
Fixes: r66230 ("handle mutexes held by parent threads in children")
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      r66438 should fix it.  Ugh, I clearly shouldn't be allowed near
a keyboard anymore :<
        
           Updated by hsbt (Hiroshi SHIBATA) almost 7 years ago
          Updated by hsbt (Hiroshi SHIBATA) almost 7 years ago
          
          
        
        
      
      - Status changed from Closed to Assigned
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      It still exists. https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181218T230003Z.fail.html.gz
:<  I am now thinking the r66230 for [Bug #15383] is
unnecessary and [Bug #15383] is a bug somewhere else.
But I cant reproduce it atm...
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      But I cant reproduce it atm...
OK, I need a break and get some sleep (but I realize 2.6 release is
is very soon).
I can reproduce [Bug #15383] with "rb_mutex_t.fork_gen" field
removed, but fork_gen shouldn't be necessary in mutex because of
rb_mutex_abandon_* functions and rb_mutex_cleanup_keeping_mutexes
already call list_head_init(waitq).
So there is a bug in one of the rb_mutex_{abandon,cleanup}*
functions which is hidden by the addition of rb_mutex_t.fork_gen.
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      - Status changed from Assigned to Closed
Applied in changeset trunk|r66477.
thread_sync.c (rb_mutex_cleanup_keeping_mutexes): update fork_gen
... when clearing waitq. Otherwise, we risk redundantly clearing
valid waiters in future calls to mutex_ptr.
Note: I am not sure if this fixes [Bug #15430], and even if it
did, fork_gen is a belt-and-suspenders redundancy for [Bug #15383]
which wastes one word for every Mutex object.
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      Fwiw, I think the bug is in rb_mutex_abandon_locking_mutex,
it needs to clear the waitq unconditionally; but not touch
mutex->th or mutex->next_mutex:
https://80x24.org/spew/20181221223422.9219-1-e@80x24.org/raw
And as a result, rb_mutex_cleanup_keeping_mutexes is redundant.
        
           Updated by k0kubun (Takashi Kokubun) almost 7 years ago
          Updated by k0kubun (Takashi Kokubun) almost 7 years ago
          
          
        
        
      
      - Status changed from Closed to Assigned
As of r66490, the test on some CIs is still failing for different reasons.
Expected "#<Process::Status: pid 515 SIGKILL (signal 9)>" to be empty.
https://rubyci.org/logs/rubyci.s3.amazonaws.com/debian8/ruby-trunk/log/20181222T063002Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/gentoo/ruby-trunk/log/20181222T063003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181222T050003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/fedora27/ruby-trunk/log/20181222T063002Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181222T060002Z.fail.html.gz
ArgumentError: invalid resource name: NPROC
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable10x/ruby-trunk/log/20181222T051806Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable11x/ruby-trunk/log/20181222T052407Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable10s/ruby-trunk/log/20181222T001911Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/unstable11s/ruby-trunk/log/20181222T052505Z.fail.html.gz
Could you take a look at it again?
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      - Status changed from Assigned to Closed
Applied in changeset trunk|r66494.
test/ruby/test_thread.rb (test_fork_while_parent_locked): reduce threads
Reduce thread counts unconditionally for some CI systems with
low limits..  And Solaris apparently lacks RLIMIT_NPROC, so we
can't detect resource limits and scale the test appropriately.
[ruby-core:90670] [Bug #15430]
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      takashikkbn@gmail.com wrote:
As of r66490, the test on some CIs is still failing for different reasons.
Expected "#<Process::Status: pid 515 SIGKILL (signal 9)>" to be empty.
OK, I lowered thread counts in r66494.
Btw, can we get dmesg or kernel logs from those CI machines to
confirm it was OOM killer?  I guess RLIMIT_NPROC was a bad check
anyways; and OOM killer is a little unpredictable...
ArgumentError: invalid resource name: NPROC
Interesting; I haven't touched Solaris in ages but I never
thought it lacked RLIMIT_NPROC.  So I unconditionally removed
the check.
        
           Updated by k0kubun (Takashi Kokubun) almost 7 years ago
          Updated by k0kubun (Takashi Kokubun) almost 7 years ago
          
          
        
        
      
      - Status changed from Closed to Assigned
As of r66497 or r66498, some Linux environments are getting:
  1) Failure:
TestThread#test_fork_while_parent_locked [/home/hsbt/chkbuild/tmp/build/20181222T213003Z/ruby/test/ruby/test_thread.rb:1278]:
lower `nr' if SIGKILL because of RLIMIT_NPROC limit.
Expected "#<Process::Status: pid 17656 SIGKILL (signal 9)>" to be empty.
https://rubyci.org/logs/rubyci.s3.amazonaws.com/debian8/ruby-trunk/log/20181222T213003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/centos7/ruby-trunk/log/20181222T230003Z.fail.html.gz
https://rubyci.org/logs/rubyci.s3.amazonaws.com/arch/ruby-trunk/log/20181222T210001Z.fail.html.gz
So reopening this. If it's just a bad check, please consider eliminating the check or skipping the test itself to avoid false-positive alarms.
Btw, can we get dmesg or kernel logs from those CI machines to confirm it was OOM killer?
I don't have the access to most of RubyCI servers. Please contact @hsbt (Hiroshi SHIBATA), @naruse (Yui NARUSE), or possibly @akr (Akira Tanaka) about those servers.
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      - Status changed from Assigned to Closed
Applied in changeset trunk|r66508.
test/ruby/test_thread.rb (test_fork_while_parent_locked): rewrite to avoid OOM
Instead of using a torture test, trigger the condition for the old
segfault in [Bug #15383] exactly.
[ruby-core:90676] [Bug #15430]
        
           Updated by normalperson (Eric Wong) almost 7 years ago
          Updated by normalperson (Eric Wong) almost 7 years ago
          
          
        
        
      
      So reopening this. If it's just a bad check, please consider
eliminating the check or skipping the test itself to avoid
false-positive alarms.
OK, I rewrote the test to surgically trigger the condition
r66508.  Feel free to delete the test if it still fails; I don't
think I'll be around much more.
        
           Updated by k0kubun (Takashi Kokubun) almost 7 years ago
          Updated by k0kubun (Takashi Kokubun) almost 7 years ago
          
          
        
        
      
      It seems to be all green now. Thank you for your help.