Project

General

Profile

Actions

Bug #17941

closed

POSIX timers are not fork-safe on 2.7.x

Added by jbrownEP (James Brown) over 3 years ago. Updated over 3 years ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:104199]

Description

I have an application which runs under a pre-fork-model web server (uwsgi).

When a worker exits, SIGCHLD will be delivered to the parent process, which is handled by sighandler and rb_thread_wakeup_timer_thread. This calls ubf_timer_arm in the parent process, which creates a POSIX timer (by calling timer_create). After that, uWSGI will fork off a new child, but by this point, the timer_posix global will have been populated by a timerid owned by the parent process.

At some point, that new child will call ubf_timer_disarm, which will then call timer_settime with the timer_id that was generated in the parent thread.

On Linux, it is fatal to share timers between parent/child processes, so the timer_settime invocation will fail with EINVAL and ruby will abort with a stack trace rather like the following:

#0  0x00007fe0c60784f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007fe0c6079cd5 in abort () at abort.c:92
#2  0x00007fe0c6644363 in die (fmt=0x7fe0c6817285 "%s: %s (%s)") at error.c:632
#3  rb_bug (fmt=0x7fe0c6817285 "%s: %s (%s)") at error.c:647
#4  0x00007fe0c66b39f9 in rb_bug_errno (mesg=0x7fe0c6842dde "timer_settime (disarm)", errno_arg=22) at error.c:676
#5  0x00007fe0c67b6ee0 in ubf_timer_disarm () at thread_pthread.c:1721
#6  0x00007fe0c67b94da in rb_sigwait_fd_get (th=<value optimized out>) at thread_pthread.c:1919
#7  0x00007fe0c67bd74d in rb_threadptr_execute_interrupts (th=<value optimized out>, blocking_timing=0) at thread.c:2215
#8  0x00007fe0c67bda2e in rb_vm_check_ints (th=0x1c29480, region=0x7ffe4b15da40, ubf=0x7fe0c67ba780 <ubf_select>, arg=0x1c29480, fail_if_interrupted=0) at vm_core.h:1868
#9  unblock_function_set (th=0x1c29480, region=0x7ffe4b15da40, ubf=0x7fe0c67ba780 <ubf_select>, arg=0x1c29480, fail_if_interrupted=0) at thread.c:461
#10 blocking_region_begin (th=0x1c29480, region=0x7ffe4b15da40, ubf=0x7fe0c67ba780 <ubf_select>, arg=0x1c29480, fail_if_interrupted=0) at thread.c:1416
#11 0x00007fe0c67c1652 in rb_nogvl (func=0x7fe0c66bc140 <no_gvl_stat>, data1=0x7ffe4b15db20, ubf=<value optimized out>, data2=<value optimized out>, flags=<value optimized out>) at thread.c:1470
#12 0x00007fe0c66c08fa in stat_without_gvl (basedir=8, path=<value optimized out>, origenc=0x0, mode=RB_REALPATH_CHECK) at file.c:1120
#13 rb_check_realpath_internal (basedir=8, path=<value optimized out>, origenc=0x0, mode=RB_REALPATH_CHECK) at file.c:4401
#14 0x00007fe0c66f913f in rb_construct_expanded_load_path (type=EXPAND_ALL, has_relative=0x7ffe4b15dc0c, has_non_cache=0x7ffe4b15dc08) at load.c:82
#15 0x00007fe0c66f9351 in rb_get_expanded_load_path () at load.c:100
#16 0x00007fe0c66f9ada in rb_feature_p (feature=0x26d0e30 "/home/vagrant/pg/easy_post/uwsgi_tasks.rb", ext=0x26d0e56 ".rb", rb=1, expanded=0, fn=0x7ffe4b15dd00) at load.c:466
#17 0x00007fe0c66f7a86 in search_required (fname=36037560, path=0x7ffe4b15de80, rb_feature_p=0x7fe0c66f93f0 <rb_feature_p>) at load.c:847
#18 0x00007fe0c66f85d4 in require_internal (ec=0x1c4ea00, fname=36037560, exception=1) at load.c:1002
#19 0x00007fe0c66f8d2b in rb_require_string (fname=36038000) at load.c:1104
#20 0x00007fe0c67ee565 in vm_call0_cfunc_with_frame (ec=0x1c4ea00, calling=<value optimized out>, cd=<value optimized out>, argv=0x7ffe4b15e140) at vm_eval.c:91
#21 vm_call0_cfunc (ec=0x1c4ea00, calling=<value optimized out>, cd=<value optimized out>, argv=0x7ffe4b15e140) at vm_eval.c:105
#22 vm_call0_body (ec=0x1c4ea00, calling=<value optimized out>, cd=<value optimized out>, argv=0x7ffe4b15e140) at vm_eval.c:140
#23 0x00007fe0c67ee9f3 in rb_vm_call0 (ec=<value optimized out>, recv=<value optimized out>, id=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, me=0x225e430, kw_splat=0) at vm_eval.c:52
#24 0x00007fe0c67eeda6 in rb_vm_call_kw (ec=0x1c4ea00, recv=30364240, id=21185, argc=1, argv=0x7ffe4b15e140, me=<value optimized out>, kw_splat=0) at vm_eval.c:268
#25 0x00007fe0c67f133d in rb_call0 (recv=30364240, mid=21185, argc=1, argv=0x7ffe4b15e140) at vm_eval.c:392
#26 rb_call (recv=30364240, mid=21185, argc=1, argv=0x7ffe4b15e140) at vm_eval.c:718
#27 rb_funcallv (recv=30364240, mid=21185, argc=1, argv=0x7ffe4b15e140) at vm_eval.c:958
#28 0x00000000004968a4 in uwsgi_require_file ()
#29 0x00007fe0c66b9355 in rb_protect (proc=0x496860 <uwsgi_require_file>, data=36038000, pstate=0x7ffe4b15e2bc) at eval.c:1087
#30 0x00000000004984f0 in uwsgi_rack_init_apps ()
#31 0x000000000046e8d2 in uwsgi_init_all_apps ()
#32 0x00000000004747d8 in uwsgi_worker_run ()
#33 0x0000000000474cee in uwsgi_run ()
#34 0x00007fe0c6064d20 in __libc_start_main (main=0x474d00 <main>, argc=3, ubp_av=0x7ffe4b15e468, init=<value optimized out>, fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7ffe4b15e458) at libc-start.c:226
#35 0x00000000004213b9 in _start ()

There are a few things that seem likely as fixes for this:

  1. In ubf_timer_disarm, compare timer_posix.owner with getpid(), as is done in ubf_timer_arm. If the timer is owned by a different process, it's probably a no-op to disarm it and should be reset back to NULL
  2. Install an atfork function ubf_timer_atfork that resets timer_posix to NULL
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0