Project

General

Profile

Actions

Bug #17993

closed

Ruby process stuck in rb_sigwait_sleep

Added by cout (Paul Brannan) almost 3 years ago. Updated almost 3 years ago.

Status:
Third Party's Issue
Assignee:
-
Target version:
-
ruby -v:
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux-gnu]
[ruby-core:104322]

Description

I have a very simple client for kudu (https://kudu.apache.org/) implement as a wrapper that I invoke with FFI. After creating a kudu client with KuduClientBuilder::Build(), subsequent calls to system cause the ruby interpreter to block indefinitely. It is unclear to me whether the fault is in the ruby interpreter, in my wrapper code, or in kudu.

The wrapper looks like this:

#include <kudu/client/client.h>

#include <string>

extern "C" {

int build_client(char const * addr) {
  ::kudu::client::sp::shared_ptr<::kudu::client::KuduClient> client;

  auto status = ::kudu::client::KuduClientBuilder()
      .add_master_server_addr(addr)
      .Build(&client);

  return status.ok();
}

}

And the ruby script looks like this:

require 'ffi'

module TestFFI
  extend ::FFI::Library
  ffi_lib './libtest_ffi.so'
  attach_function :build_client, [ :strptr ], :int
end

TestFFI.build_client(ARGV[0])

system("true");

The backtrace for the main thread looks like this:

#0  0x00007ff93fbe3811 in ppoll () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff9411030e0 in rb_sigwait_sleep (th=th@entry=0x1261440, sigwait_fd=sigwait_fd@entry=3, rel=rel@entry=0x0) at /data/ruby-2.7.3/hrtime.h:148
#2  0x00007ff941104488 in native_sleep (th=0x1261440, rel=0x0) at /data/ruby-2.7.3/thread_pthread.c:2099
#3  0x00007ff94110708e in rb_thread_sleep_interruptible () at /data/ruby-2.7.3/thread.c:1329
#4  0x00007ff94106d945 in waitpid_sleep (x=x@entry=140729366950784) at /data/ruby-2.7.3/process.c:1151
#5  0x00007ff940f8f32c in rb_ensure (b_proc=b_proc@entry=0x7ff94106d930 <waitpid_sleep>, data1=data1@entry=140729366950784, e_proc=e_proc@entry=0x7ff94106d8e0 <waitpid_cleanup>, 
    data2=data2@entry=140729366950784) at /data/ruby-2.7.3/eval.c:1129
#6  0x00007ff941075389 in rb_f_system (argc=<optimized out>, argv=<optimized out>, _=<optimized out>) at /data/ruby-2.7.3/process.c:4557
#7  0x00007ff941138f53 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=<optimized out>, calling=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>)
    at /data/ruby-2.7.3/vm_insnhelper.c:2514
#8  vm_call_cfunc (ec=0x1261a50, reg_cfp=0x7ff94168ffa0, calling=<optimized out>, cd=0x156d680) at /data/ruby-2.7.3/vm_insnhelper.c:2539
#9  0x00007ff9411459ac in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>)
    at /data/ruby-2.7.3/vm_insnhelper.c:4023
#10 vm_exec_core (ec=0x7ffe1bed31b8, initial=1) at insns.def:801
#11 0x00007ff94114bc1f in rb_vm_exec (ec=0x1261a50, mjit_enable_p=1) at /data/ruby-2.7.3/vm.c:1929
#12 0x00007ff940f8bbd0 in rb_ec_exec_node (ec=ec@entry=0x1261a50, n=n@entry=0x13ce5c8) at /data/ruby-2.7.3/eval.c:278
#13 0x00007ff940f91078 in ruby_run_node (n=0x13ce5c8) at /data/ruby-2.7.3/eval.c:336
#14 0x00000000004009df in main (argc=<optimized out>, argv=<optimized out>) at /data/ruby-2.7.3/main.c:50

Tested with ruby 2.7.3p183 and kudu 1.10.0-cdh6.3.0 on Ubuntu 16.04.

Everything works as expected on ruby 2.5.

I have not tried any other combinations of ruby or kudu.

Updated by jeremyevans0 (Jeremy Evans) almost 3 years ago

Is the build_client function installing a SIGCLD handler? The code path in the backtrace is what is taken when #system uses SIGCLD/SIGCHLD for waiting, and maybe a separate SIGCLD handler interfers with that.

Since this worked on Ruby 2.5, it may be related to #16835.

Updated by xtkoba (Tee KOBAYASHI) almost 3 years ago

I doubt if installation of SIGCHLD handler is relevant. It is true that the handler for SIGCHLD is changed in Webserver::Start() (src/kudu/server/webserver.cc):

sighandler_t sig_chld = signal(SIGCHLD, SIG_DFL);
/* ... */
signal(SIGCHLD, sig_chld);

but as the previous handler is restored it should not affect the subsequent #system. Furthermore, this is in the server side, not client.

if #note-1 were the point, a workaround would be as follows:

# ...

TestFFI.build_client(ARGV[0])
Signal.trap(:CHLD, "DEFAULT") # added
system "true"

Updated by cout (Paul Brannan) almost 3 years ago

As far as I can tell looking at the kudu source, none of the threads created by the kudu library installs a SIGCHLD handler. But this certainly feels related, because system behaves as expected if I ignore SIGCHLD prior to calling it. And if I do not ignore SIGCHLD, then signhandler never gets called.

I do see this in the output of strace:

rt_sigaction(SIGCHLD, {0x7f67b29b946f, ~[RTMIN RT_1], SA_RESTORER|SA_RESTART, 0x7f67b81be390}, NULL, 8) = 0

which indicates it is installing its own signal handler, but I don't see anything in the source code of kudu that would do this (Webserver::Start() is iirc only called from the server processes, not from the client library).

Actions #4

Updated by cout (Paul Brannan) almost 3 years ago

I think I have tracked the problem to libev. Kudu uses libev for its reactor, and libev installs a signal handler for SIGCHLD (http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod):

Libev grabs SIGCHLD as soon as the default event loop is initialised. This is necessary to guarantee proper behaviour even if the first child watcher is started after the child exits. The occurrence of SIGCHLD is recorded asynchronously, but child reaping is done synchronously as part of the event loop processing. Libev always reaps all children, even ones not watched.

Unfortunately AFAICT there is no good way to turn this off through the C++ interface to libev, even if it is not needed. That means any library that uses libev cannot be used with ruby out of the box. My gut feeling is that this is libev's fault for installing a SIGCHLD handler, not invoking the originally installed SIGCHLD handler, and reaping child processes that it did not start.

Actions #5

Updated by jeremyevans0 (Jeremy Evans) almost 3 years ago

  • Status changed from Open to Third Party's Issue
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0