Bug #17941: POSIX timers are not fork-safe on 2.7.x - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #17941

closed

POSIX timers are not fork-safe on 2.7.x

Added by jbrownEP (James Brown) about 4 years ago. Updated about 4 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

Backport:

2.6: REQUIRED, 2.7: REQUIRED, 3.0: DONE

[ruby-core:104199]

Description

I have an application which runs under a pre-fork-model web server (uwsgi).

When a worker exits, SIGCHLD will be delivered to the parent process, which is handled by sighandler and rb_thread_wakeup_timer_thread. This calls ubf_timer_arm in the parent process, which creates a POSIX timer (by calling timer_create). After that, uWSGI will fork off a new child, but by this point, the timer_posix global will have been populated by a timerid owned by the parent process.

At some point, that new child will call ubf_timer_disarm, which will then call timer_settime with the timer_id that was generated in the parent thread.

On Linux, it is fatal to share timers between parent/child processes, so the timer_settime invocation will fail with EINVAL and ruby will abort with a stack trace rather like the following:

#0  0x00007fe0c60784f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fe0c6079cd5 in abort () at abort.c:92
#2  0x00007fe0c6644363 in die (fmt=0x7fe0c6817285 "%s: %s (%s)") at error.c:632
#3  rb_bug (fmt=0x7fe0c6817285 "%s: %s (%s)") at error.c:647
#4  0x00007fe0c66b39f9 in rb_bug_errno (mesg=0x7fe0c6842dde "timer_settime (disarm)", errno_arg=22) at error.c:676
#5  0x00007fe0c67b6ee0 in ubf_timer_disarm () at thread_pthread.c:1721
#6  0x00007fe0c67b94da in rb_sigwait_fd_get (th=<value optimized out>) at thread_pthread.c:1919
#7  0x00007fe0c67bd74d in rb_threadptr_execute_interrupts (th=<value optimized out>, blocking_timing=0) at thread.c:2215
#8  0x00007fe0c67bda2e in rb_vm_check_ints (th=0x1c29480, region=0x7ffe4b15da40, ubf=0x7fe0c67ba780 <ubf_select>, arg=0x1c29480, fail_if_interrupted=0) at vm_core.h:1868
#9  unblock_function_set (th=0x1c29480, region=0x7ffe4b15da40, ubf=0x7fe0c67ba780 <ubf_select>, arg=0x1c29480, fail_if_interrupted=0) at thread.c:461
#10 blocking_region_begin (th=0x1c29480, region=0x7ffe4b15da40, ubf=0x7fe0c67ba780 <ubf_select>, arg=0x1c29480, fail_if_interrupted=0) at thread.c:1416
#11 0x00007fe0c67c1652 in rb_nogvl (func=0x7fe0c66bc140 <no_gvl_stat>, data1=0x7ffe4b15db20, ubf=<value optimized out>, data2=<value optimized out>, flags=<value optimized out>) at thread.c:1470
#12 0x00007fe0c66c08fa in stat_without_gvl (basedir=8, path=<value optimized out>, origenc=0x0, mode=RB_REALPATH_CHECK) at file.c:1120
#13 rb_check_realpath_internal (basedir=8, path=<value optimized out>, origenc=0x0, mode=RB_REALPATH_CHECK) at file.c:4401
#14 0x00007fe0c66f913f in rb_construct_expanded_load_path (type=EXPAND_ALL, has_relative=0x7ffe4b15dc0c, has_non_cache=0x7ffe4b15dc08) at load.c:82
#15 0x00007fe0c66f9351 in rb_get_expanded_load_path () at load.c:100
#16 0x00007fe0c66f9ada in rb_feature_p (feature=0x26d0e30 "/home/vagrant/pg/easy_post/uwsgi_tasks.rb", ext=0x26d0e56 ".rb", rb=1, expanded=0, fn=0x7ffe4b15dd00) at load.c:466
#17 0x00007fe0c66f7a86 in search_required (fname=36037560, path=0x7ffe4b15de80, rb_feature_p=0x7fe0c66f93f0 <rb_feature_p>) at load.c:847
#18 0x00007fe0c66f85d4 in require_internal (ec=0x1c4ea00, fname=36037560, exception=1) at load.c:1002
#19 0x00007fe0c66f8d2b in rb_require_string (fname=36038000) at load.c:1104
#20 0x00007fe0c67ee565 in vm_call0_cfunc_with_frame (ec=0x1c4ea00, calling=<value optimized out>, cd=<value optimized out>, argv=0x7ffe4b15e140) at vm_eval.c:91
#21 vm_call0_cfunc (ec=0x1c4ea00, calling=<value optimized out>, cd=<value optimized out>, argv=0x7ffe4b15e140) at vm_eval.c:105
#22 vm_call0_body (ec=0x1c4ea00, calling=<value optimized out>, cd=<value optimized out>, argv=0x7ffe4b15e140) at vm_eval.c:140
#23 0x00007fe0c67ee9f3 in rb_vm_call0 (ec=<value optimized out>, recv=<value optimized out>, id=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, me=0x225e430, kw_splat=0) at vm_eval.c:52
#24 0x00007fe0c67eeda6 in rb_vm_call_kw (ec=0x1c4ea00, recv=30364240, id=21185, argc=1, argv=0x7ffe4b15e140, me=<value optimized out>, kw_splat=0) at vm_eval.c:268
#25 0x00007fe0c67f133d in rb_call0 (recv=30364240, mid=21185, argc=1, argv=0x7ffe4b15e140) at vm_eval.c:392
#26 rb_call (recv=30364240, mid=21185, argc=1, argv=0x7ffe4b15e140) at vm_eval.c:718
#27 rb_funcallv (recv=30364240, mid=21185, argc=1, argv=0x7ffe4b15e140) at vm_eval.c:958
#28 0x00000000004968a4 in uwsgi_require_file ()
#29 0x00007fe0c66b9355 in rb_protect (proc=0x496860 <uwsgi_require_file>, data=36038000, pstate=0x7ffe4b15e2bc) at eval.c:1087
#30 0x00000000004984f0 in uwsgi_rack_init_apps ()
#31 0x000000000046e8d2 in uwsgi_init_all_apps ()
#32 0x00000000004747d8 in uwsgi_worker_run ()
#33 0x0000000000474cee in uwsgi_run ()
#34 0x00007fe0c6064d20 in __libc_start_main (main=0x474d00 <main>, argc=3, ubp_av=0x7ffe4b15e468, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7ffe4b15e458) at libc-start.c:226
#35 0x00000000004213b9 in _start ()

There are a few things that seem likely as fixes for this:

In ubf_timer_disarm, compare timer_posix.owner with getpid(), as is done in ubf_timer_arm. If the timer is owned by a different process, it's probably a no-op to disarm it and should be reset back to NULL
Install an atfork function ubf_timer_atfork that resets timer_posix to NULL

Actions

Copy link

#1 [ruby-core:104200]

Updated by jbrownEP (James Brown) about 4 years ago

Here is a small C program which aborts on Ruby 2.7+ on Linux but works on Ruby 2.5:

#include <ruby.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <signal.h>


int main(int argc, char **argv) {
    ruby_init();
    ruby_init_loadpath();

    // send a no-op signal in order to trigger ruby signal handling in the parent
    kill(getpid(), SIGUSR1);

    int pid2 = fork();
    if (pid2 == 0) {
        int state = 0;
        // ignore SIGUSR1 if it gets dequeued here
        rb_eval_string("Signal.trap('USR1') { }");
        rb_eval_string_protect("puts 'Hello, world!'", &state);
        if (state) {
            fprintf(stderr, "error %d calling rb_eval_string_protect\n", state);
            return ruby_cleanup(0);
        } else {
            return ruby_cleanup(0);
        }
    } else {
        int status;
        waitpid(pid2, &status, 0);
        printf("child is done with exit status %d\n", status);
        return 0;
    }
}

Incidentally, if I patch thread_pthread.c to force #define UBF_TIMER UBF_TIMER_PTHREAD, I can confirm that this no longer aborts on 2.7.

Actions

Copy link