Project

General

Profile

Actions

Bug #11336

closed

TestProcess#test_exec_fd_3_redirect failed on Solaris 10

Added by ngoto (Naohisa Goto) over 9 years ago. Updated over 9 years ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 2.3.0dev (2015-07-06) [sparc64-solaris2.10]
[ruby-core:69886]

Description

On Solaris 10, TestProcess#test_exec_fd_3_redirect failed since r51146 (where the test was added).

  1) Failure:
TestProcess#test_exec_fd_3_redirect [/XXXXX-51146/test/ruby/test_process.rb:2049]:
<"."> expected but was
<nil>.

Related issues 2 (0 open2 closed)

Related to Ruby master - Bug #11350: When Process.exec failed, redirections were still changed and not restoredClosedActions
Related to Ruby master - Bug #11353: ASYNC BUG after failure of Process.exec when closing FD 3 (or 4 or 5)ClosedActions

Updated by normalperson (Eric Wong) over 9 years ago

Was there any output in stderr from this test? Thanks.

Updated by ngoto (Naohisa Goto) over 9 years ago

Was there any output in stderr from this test?

nothing

Recently, the failure message is changed to below.
It seems the SIGKILL is caused by timeout of the test.

TestProcess#test_exec_fd_3_redirect [/XXXXX-51186/test/ruby/test_process.rb:2038]:
assert_separately failed
pid 9248 killed by SIGKILL (signal 9)

Updated by normalperson (Eric Wong) over 9 years ago

Can you try a higher fd number instead of 3?
Also, truss output would be useful.
Thanks.

Updated by ngoto (Naohisa Goto) over 9 years ago

Sometimes ASYNC BUG occurs, but it cannot be reproduced with truss.

$ ruby -e 'a = IO.pipe; b = IO.pipe; p a; p b; pid = fork { exec("ruby", "-e", "print IO.for_fd(3).read(1)", 3=>a[0],1=>b[1]) }; b[1].close; a[0].close; a[1].write("."); p b[0].read(1); Process.wait(pid)'
[#<IO:fd 7>, #<IO:fd 8>]
[#<IO:fd 9>, #<IO:fd 10>]
[ASYNC BUG] consume_communication_pipe: read

EBADF

ruby 2.3.0dev (2015-07-08) [sparc64-solaris2.10]

[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.
Bug reports are welcome.
For details: http://www.ruby-lang.org/bugreport.html

"."

Updated by normalperson (Eric Wong) over 9 years ago

wrote:

$ ruby -e 'a = IO.pipe; b = IO.pipe; p a; p b; pid = fork { exec("ruby", "-e", "print IO.for_fd(3).read(1)", 3=>a[0],1=>b[1]) }; b[1].close; a[0].close; a[1].write("."); p b[0].read(1); Process.wait(pid)'
[#<IO:fd 7>, #<IO:fd 8>]
[#<IO:fd 9>, #<IO:fd 10>]
[ASYNC BUG] consume_communication_pipe: read

OK, I also hit the problem on a VM, too.

The problem is the timer thread is still running when we are performing
redirects for exec.

Can you try the following to stop the timer thread?

--- a/process.c
+++ b/process.c
@@ -2566,7 +2566,7 @@ rb_f_exec(int argc, const VALUE *argv)
 #if defined(__APPLE__) || defined(__HAIKU__)
     rb_exec_without_timer_thread(eargp, errmsg, sizeof(errmsg));
 #else
-    before_exec_async_signal_safe(); /* async-signal-safe */
+    before_exec(); /* NOT async-signal-safe */
     rb_exec_async_signal_safe(eargp, errmsg, sizeof(errmsg));
     preserving_errno(after_exec_async_signal_safe()); /* async-signal-safe */
 #endif

I will have limited Internet access the next few days.

Feel free to commit if it works for you; maybe the timer thread needs to
be restarted if execve fails, too (but the process will die).

Also, maybe the APPLE || HAIKU code above is suitable for all
OS, too.

In the coming weeks, we may also consider lazy spawning timer thread,
single-threaded scripts do not need it.

Updated by ngoto (Naohisa Goto) over 9 years ago

Also, maybe the APPLE || HAIKU code above is suitable for all
OS, too.

I think so, and the following patch solves the failure ofTestProcess#test_exec_fd_3_redirect on Solaris 10.

Index: process.c
===================================================================
--- process.c   (revision 51194)
+++ process.c   (working copy)
@@ -2473,7 +2473,7 @@
     RB_GC_GUARD(execarg_obj);
 }
 
-#if defined(__APPLE__) || defined(__HAIKU__)
+#if defined(__APPLE__) || defined(__HAIKU__) || defined(__sun)
 static int rb_exec_without_timer_thread(const struct rb_execarg *eargp, char *errmsg, size_t errmsg_buflen);
 #endif
 
@@ -2563,7 +2563,7 @@
     rb_execarg_parent_start(execarg_obj);
     fail_str = eargp->use_shell ? eargp->invoke.sh.shell_script : eargp->invoke.cmd.command_name;
 
-#if defined(__APPLE__) || defined(__HAIKU__)
+#if defined(__APPLE__) || defined(__HAIKU__) || defined(__sun)
     rb_exec_without_timer_thread(eargp, errmsg, sizeof(errmsg));
 #else
     before_exec_async_signal_safe(); /* async-signal-safe */
@@ -3077,7 +3077,7 @@
     return -1;
 }
 
-#if defined(__APPLE__) || defined(__HAIKU__)
+#if defined(__APPLE__) || defined(__HAIKU__) || defined(__sun)
 static int
 rb_exec_without_timer_thread(const struct rb_execarg *eargp, char *errmsg, size_t errmsg_buflen)
 {

Updated by ngoto (Naohisa Goto) over 9 years ago

The following example can reproduce the problem that the timer thread reads fd=3 after the redirection to fd=3 is set and before execve() is called, and then the script waits forever in b[0].read(1) and/or IO.for_fd(3).read(1).

ruby -e 'a = IO.pipe; b = IO.pipe; p a; p b; opt = { 3=>a[0], 1=>b[1] }; (5..50).each { |i| opt[i] = "/dev/zero" }; pid = fork { exec("ruby", "-e", "print IO.for_fd(3).read(1)", opt)}; b[1].close; a[0].close; a[1].write("."); p b[0].read(1); Process.wait(pid)'

So, I also think stopping timer thread before execve is needed on all OS.

Updated by ngoto (Naohisa Goto) over 9 years ago

I forgot to mention that this was reproduced on x86_64 Linux.

Actions #9

Updated by ngoto (Naohisa Goto) over 9 years ago

  • Status changed from Open to Closed

Applied in changeset r51209.


  • process.c (rb_f_exec): rb_exec_without_timer_thread should be
    used on every OS, not only on Mac OS X or Haiku, to prevent
    timer thread communications using file descriptor 3 after
    setting redirection of the fd 3 before calling exec.
    [Bug #11336] [ruby-core:69886]
Actions #10

Updated by ngoto (Naohisa Goto) over 9 years ago

  • Related to Bug #11350: When Process.exec failed, redirections were still changed and not restored added
Actions #11

Updated by ngoto (Naohisa Goto) over 9 years ago

  • Related to Bug #11353: ASYNC BUG after failure of Process.exec when closing FD 3 (or 4 or 5) added

Updated by ngoto (Naohisa Goto) over 9 years ago

  • Status changed from Closed to Open

In r51265, I reverted r51209 because signals after stopping timer thread before entering exec(2) system call might be lost.

On Mac OS X and Haiku, the loss of signal still might occur.

Actions #13

Updated by ngoto (Naohisa Goto) over 9 years ago

  • Status changed from Open to Closed

Applied in changeset r51268.


  • process.c (redirect_dup2): when the new FD of dup2() coflicts
    with one of the timer thread FDs, the internal FD is diverted.
    [Bug #11336] [ruby-core:69886] [Bug #11350] [ruby-core:69961]

  • process.c (dup2_with_divert): new function for the above purpose.

  • thread_pthread.c (rb_divert_reserved_fd): new function for
    diverting reserved FD. If the given FD is the same as one of the
    reserved FDs, the reserved FD number is internally changed.
    It returns -1 when error. Otherwise, returns 0. It also returns
    0 if there is no need to change reserved FD number.

  • thread_win32.c (rb_divert_reserved_fd): always returns 0 because
    of no reserved FDs.

  • internal.h (rb_divert_reserved_fd): prototype declaration.
    It is Ruby internal use only.

Updated by normalperson (Eric Wong) over 9 years ago

Sorry, but I think rb_divert_reserved_fd seems a racy fix. I think the
correct fix is to allow single-threaded Ruby to process signals without
timer thread; I will try to make it happen now.

Updated by ngoto (Naohisa Goto) over 9 years ago

Sorry, but I think rb_divert_reserved_fd seems a racy fix.

Yes, it might be. Comparison and replacement of the fd is not atomic there, and calling dup2 after the rb_divert_reserved_fd is also a source of race.

Indeed, I implicitly expected that rb_divert_reserved_fd and dup2_with_divert were executed with GVL, but this might not be true in some cases.

Updated by normalperson (Eric Wong) over 9 years ago

I started working on this patch which causes pipes to always
be closed during timer thread shutdown.

http://80x24.org/spew/m/839ba256acbf1a5b17ad60953fc3ecf1df6cec50.txt

However, I may also eliminate timer thread entirely (on POSIX) by
using timer_create+timer_settime to deliver SIGVTALRM.

I also eliminated malloc from signal thread list (but not FGLOCK, yet),
but eliminating malloc may not be necessary.
http://80x24.org/spew/m/1437190754-29067-1-git-send-email-e@80x24.org.txt

Updated by ngoto (Naohisa Goto) over 9 years ago

With the patch, when exec system call fails, signals that are sent after stopping the timer thread before re-starting the timer thread may be ignored.

Currently, except Mac OS X and Haiku, the signals could be invoked after the failure of exec with no loss.

Updated by normalperson (Eric Wong) over 9 years ago

wrote:

With the patch, when exec system call fails, signals that are sent
after stopping the timer thread before re-starting the timer thread
may be ignored.

I'm not seeing it, did you notice/reproduce this failure?

I see:

  1. rb_thread_stop_timer_thread()

  2. signal received:
    sighandler
    -> signal_enque -> ATOMIC_INC(signal_buff.size) (++)
    -> rb_thread_wakeup_timer_thread -> noop since TT not running

  3. exec fails

  4. rb_thread_start_timer_thread()
    -> pthread_create(..., thread_timer)
    -> thread_timer
    -> timer_thread_function
    -> rb_threadptr_check_signal
    -> if (rb_signal_buff_size() > 0) { /* true from (++) /
    /
    continue processing signals */

In other words, I'm not seeing a problem based on the (++) steps.

Updated by normalperson (Eric Wong) over 9 years ago

Eric Wong wrote:

I started working on this patch which causes pipes to always
be closed during timer thread shutdown.

http://80x24.org/spew/m/839ba256acbf1a5b17ad60953fc3ecf1df6cec50.txt

OK, need to check FD validity, too. Still tricky:
http://80x24.org/spew/m/1437379741-6844-1-git-send-email-e@80x24.org.txt

However, I may also eliminate timer thread entirely (on POSIX) by
using timer_create+timer_settime to deliver SIGVTALRM.

Maybe...

Updated by ngoto (Naohisa Goto) over 9 years ago

I'm sorry too late.

On Solaris 10, the patch works fine, with no error/failure during make test-all (with r51450).

I'm not seeing it, did you notice/reproduce this failure?

No.
It seems the patch works correctly as you described.

I'm wondering whether the TODO in the following comment is still valid or not.
It seems it was first added in r32244 and was modified in r32276.

      /* close communication pipe */
      if (close_anyway) {
          /* TODO: Uninstall all signal handlers or mask all signals.
           *       This pass is cleaning phase (terminate ruby process).
           *       To avoid such race, we skip to close communication
           *       pipe.  OS will close it at process termination.
           *       It may not good practice, but pragmatic.
           *       We remain it is TODO.
           */
          /* close_communication_pipe(); */
      }

Updated by normalperson (Eric Wong) over 9 years ago

wrote:

I'm sorry too late.

No worries, I don't have much time for ruby these few weeks, either.

On Solaris 10, the patch works fine, with no error/failure during make test-all (with r51450).

I'm not seeing it, did you notice/reproduce this failure?

No.
It seems the patch works correctly as you described.

I'm wondering whether the TODO in the following comment is still valid or not.
It seems it was first added in r32244 and was modified in r32276.

Not valid, I removed it in this patch [ruby-core:70050]:

http://80x24.org/spew/m/1437379741-6844-1-git-send-email-e@80x24.org.txt

Actions #22

Updated by ngoto (Naohisa Goto) over 9 years ago

  • Status changed from Closed to Open

I think the patch http://80x24.org/spew/m/1437379741-6844-1-git-send-email-e@80x24.org.txt is better not only for Solaris but also for other OSs.

Actions #23

Updated by ngoto (Naohisa Goto) over 9 years ago

  • Status changed from Open to Closed

memo:
r51576 fixes the bug (+ r51578 suppresses warning message)
r51268 is reverted by r51576

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0