Project

General

Profile

Actions

Bug #21571

closed

Ruby forked process sporadically hanging on exit

Added by dmorner (Daniel Orner) 1 day ago. Updated 1 day ago.

Status:
Rejected
Assignee:
-
Target version:
-
ruby -v:
ruby 3.4.5 (2025-07-16 revision 20cda200d3) +YJIT +PRISM [x86_64-linux]
[ruby-core:123224]

Description

This is my first bug report, so please let me know if there's anything I can do to improve it.

We have a production-grade Rails app that's been running for many years. We recently moved to EKS and upgraded it to the latest Ruby and Rails. We have a number of delayed_job processes that fork on every job that comes in so that the OS can reclaim the memory used in executing it (we implemented this a long time ago because Ruby never gives up any memory that it takes, and some jobs use way more memory than others).

In the last couple of weeks, we've noticed a rare occurrence where the delayed job hangs when exiting. The code looks like this:

    Process.fork do
      ActiveRecord::Base.establish_connection
      execute_job
    end
    Process.wait

The forked child process doesn't exit when this bug occurs, it's just stuck forever, doing nothing.

Obviously I don't have a way to reproduce this because it happens maybe once every few thousand jobs, and it happens across all job types.

If I run gdb on the child process, I always see something that looks like this (note: I am a total gdb newbie):

#0  __futex_abstimed_wait_common
    (futex_word=futex_word@entry=0x7fb6af41400c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=, cancel=cancel@entry=false) at ./nptl/futex-internal.c:103
#1  0x00007fb6d5677f68 in __GI___futex_abstimed_wait64
    (futex_word=futex_word@entry=0x7fb6af41400c, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=) at ./nptl/futex-internal.c:128
#2  0x00007fb6d568138c in __pthread_rwlock_wrlock_full64 (abstime=0x0, clockid=0, rwlock=0x7fb6af414000) at ./nptl/pthread_rwlock_common.c:730
#3  ___pthread_rwlock_wrlock (rwlock=0x7fb6af414000) at ./nptl/pthread_rwlock_wrlock.c:26
#4  0x00007fb6aee22989 in CRYPTO_THREAD_write_lock () at /lib/x86_64-linux-gnu/libcrypto.so.3
#5  0x00007fb6aee15c6a in  () at /lib/x86_64-linux-gnu/libcrypto.so.3
#6  0x00007fb6aee15fa9 in OPENSSL_thread_stop () at /lib/x86_64-linux-gnu/libcrypto.so.3
#7  0x00007fb6aee153b5 in OPENSSL_cleanup () at /lib/x86_64-linux-gnu/libcrypto.so.3
#8  0x00007fb6d563055d in __run_exit_handlers
    (status=0, listp=0x7fb6d57c5820 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at ./stdlib/exit.c:116
#9  0x00007fb6d563069a in __GI_exit (status=) at ./stdlib/exit.c:146
#10 0x00007fb6d5ad3a80 in ruby_stop (ex=) at eval.c:290
#11 0x00007fb6d5bc47b4 in rb_f_fork (obj=) at process.c:4388
#12 rb_f_fork (obj=) at process.c:4378
#13 0x00007fb6d5cad5cc in vm_call_cfunc_with_frame_
    (stack_bottom=, argv=, argc=0, calling=, reg_cfp=0x7fb6d4f68280, ec=0x7fb6d4e4d550)
    at /usr/src/ruby/vm_insnhelper.c:3794
#14 vm_call_cfunc_with_frame (ec=0x7fb6d4e4d550, reg_cfp=0x7fb6d4f68280, calling=) at /usr/src/ruby/vm_insnhelper.c:3840
#15 0x00007fb6d5cb3fef in vm_sendish
    (ec=0x7fb6d4e4d550, reg_cfp=0x7fb6d4f68280, cd=0x7fb69fb17650, block_handler=, method_explorer=mexp_search_method)
    at /usr/src/ruby/vm_callinfo.h:415
#16 0x00007fb6d5cc1e59 in vm_exec_core (ec=0x7fb6af41400c, ec@entry=0x7fb6d4e4d550) at /usr/src/ruby/insns.def:851
#17 0x00007fb6d5cc7ba9 in rb_vm_exec (ec=0x7fb6d4e4d550) at vm.c:2595
#18 0x00007fb6b13e73b9 in  ()
#19 0x00007fb6d4f68328 in  ()
...etc, I can paste more if needed

I can't seem to get call rb_backtrace() working in gdb, it never prints anything.

This seems to indicate that there's some kind of thread lock when OpenSSL is shutting down. The crazy thing is that there is only one thread for most of the processes I inspect.

Any help would be greatly appreciated!

Actions #1

Updated by dmorner (Daniel Orner) 1 day ago

  • Subject changed from Ruby sporadically hanging on exit to Ruby forked process sporadically hanging on exit

Updated by byroot (Jean Boussier) 1 day ago

  • Status changed from Open to Rejected

there is only one thread for most of the processes I inspect.

In the child. But the most likely explanation here is that there are multiple thread in the parent that forks the children. And I highly suspect one of these thread occasionally does some HTTPS requests or some other use of OpenSSL.

If you happen to fork at the wrong time, when one of these threads hold a global mutex in OpenSSL, the children might deadlock if it tries to acquire that same mutex, as the mutex is permanently held by a now dead thread.

In other words, this isn't a Ruby bug, but an application one.

A few suggestions though:

A quick and dirty workaround is to exist your child with exit!, so that exit handlers aren't run. That should "fix" your issue at hand, but could have other adverse effects.

A cleaner fix is to find that background threads in the parent, and synchronize it to ensure it's at a safepoint when you for your children, or simply to eliminate it.

However, note that:

we implemented this a long time ago because Ruby never gives up any memory that it takes

Isn't true. Ruby will free pages that are fully empty. It is true that fragmentation can sometime means more pages that you'd like remain held, but it's not that terrible. Also this ephemeral forking means the VM never has the chance of warming up, same for YJIT. So I'd really suggest to reconsider that choice.

Updated by dmorner (Daniel Orner) 1 day ago

Thanks so much for the prompt response! I learned at least two things in the last 30 seconds I hadn't known before. :) I really appreciate your patience and goodwill. I'll give your suggestions a try!

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0