Project

General

Profile

Actions

Feature #22138

open

Add `RB_NOGVL_PENDING_INTERRUPT_FAIL` flag for `rb_nogvl`.

Feature #22138: Add `RB_NOGVL_PENDING_INTERRUPT_FAIL` flag for `rb_nogvl`.

Added by ioquatix (Samuel Williams) 5 days ago. Updated 5 days ago.

Status:
Open
Target version:
-
[ruby-core:125892]

Description

Add a new flag, RB_NOGVL_PENDING_INTERRUPT_FAIL, to rb_nogvl(). When set, rb_nogvl() does not enter the blocking region (does not call the supplied function) if the current thread has pending interrupts — including interrupts that are currently masked by Thread.handle_interrupt. In that case it returns 0 without calling the function, with errno == 0 — the same as the existing RB_NOGVL_INTR_FAIL skip path (the function was never called).

This gives selector / event-loop extensions a reliable way to detect that a thread has work to do (a pending interrupt) before committing to a potentially unbounded native wait, so they can unwind and let Ruby process the interrupt instead of hanging.

Background and motivation

rb_nogvl() already supports a "fail before blocking" mode via RB_NOGVL_INTR_FAIL (0x1). With that flag, if the VM has been interrupted (RUBY_VM_INTERRUPTED_ANY) before the blocking region is entered, the function is skipped and rb_nogvl() returns 0. This is used by rb_thread_call_without_gvl2().

However, RB_NOGVL_INTR_FAIL only reacts to deliverable interrupts. It does not account for interrupts that the current thread has deliberately deferred via Thread.handle_interrupt. There is an important class of bugs where:

  1. A thread has a pending interrupt (e.g. a Thread#raise, a timeout, a shutdown/termination request).
  2. That interrupt is masked by an enclosing Thread.handle_interrupt(... => :never) / :on_blocking region — typical in schedulers, supervisors and connection pools that want to control exactly where interrupts are delivered.
  3. The thread is about to enter a native wait (e.g. kqueue/epoll/select, or some other blocking syscall) with the GVL released.
  4. Because the interrupt is masked, the existing RB_NOGVL_INTR_FAIL check does not trip, the native wait is entered, and — if nothing else wakes it — the wait can block indefinitely. The pending interrupt is never observed.

This is not hypothetical: it was hit in production through async-container and io-event, where a scheduler entering a native selector wait could hang because a pending (masked) interrupt was not noticed before the wait began.

A blocking operation that is skipped because the thread has pending work should be treated the same way as an interrupted wait: return immediately without entering the wait, and let the caller unwind so Ruby can process the interrupt.

Proposal

Add the flag:

#define RB_NOGVL_PENDING_INTERRUPT_FAIL  (0x8)

Semantics when the flag is set:

  • If the current thread has pending interrupts (as reported by the thread's pending-interrupt queue, including interrupts masked by Thread.handle_interrupt), rb_nogvl() does not call func. It returns 0.
  • Otherwise rb_nogvl() behaves as usual: it enters the blocking region, calls func, and preserves func's resulting errno.

The check is performed both as an early pre-check (before any of the blocking-region machinery runs) and again inside the blocking-region setup, to close the window between the pre-check and the point where the GVL is released.

Relationship to existing flags

  • RB_NOGVL_INTR_FAIL (0x1) reacts to deliverable VM interrupts (RUBY_VM_INTERRUPTED_ANY). It does not consider interrupts deferred by Thread.handle_interrupt.
  • RB_NOGVL_PENDING_INTERRUPT_FAIL (0x8) reacts to pending interrupts in the thread's queue, including masked ones. This is the key difference: it lets a caller bail out before a native wait even when the interrupt is not currently deliverable, so the caller can unwind to a point where the interrupt can be handled safely.

The flags are independent and may be combined with each other and with RB_NOGVL_UBF_ASYNC_SAFE / RB_NOGVL_OFFLOAD_SAFE.

Example

A common pattern is to have the callback write its result into a struct that is pre-initialised to a sentinel (e.g. -1), then report failure once back under the GVL:

struct Arguments {
    int result;
};

static void *my_func(void *ptr) {
    struct Arguments *arguments = ptr;
    arguments->result = my_syscall();
    return NULL;
}

struct Arguments arguments = {.result = -1};
rb_nogvl(my_func, &arguments, ubf, &arguments, RB_NOGVL_PENDING_INTERRUPT_FAIL);

if (arguments.result == -1) {
    // Either the syscall ran and was interrupted (errno == EINTR), or it was
    // not run at all because the thread had pending interrupts (errno == 0).
    rb_sys_fail("my_syscall");
}

Note the errno caveat for this proposal: on the skip path errno is 0, and rb_sys_fail() does not behave well with errno == 0. A caller that wants to funnel both cases through rb_sys_fail() should normalise it:

if (arguments.result == -1) {
    if (errno == 0) errno = EINTR; // skip path leaves errno == 0
    rb_sys_fail("my_syscall");
}

See "Errno handling" below for what errno == EINTR vs errno == 0 tells you.

Extensions can feature-detect the flag at build time so they keep working on older Rubies:

#ifdef RB_NOGVL_PENDING_INTERRUPT_FAIL
    flags |= RB_NOGVL_PENDING_INTERRUPT_FAIL;
#endif

Errno handling

On the pending-interrupt skip path, rb_nogvl() returns 0 with errno == 0. This matches the existing skip path used by RB_NOGVL_INTR_FAIL: when the function is never called, errno is restored to its initial value (0). The flag does not otherwise change rb_nogvl()'s errno behaviour, so this change is purely additive and carries no compatibility risk for existing extensions.

For a caller, when an operation reports "no result" (e.g. result == -1), the two interesting outcomes can be told apart by errno:

  • errno == EINTR — the function did run and its syscall was actually interrupted (the conventional EINTR). Some side effects may have occurred.
  • errno == 0 — the function did not run: it was skipped before entering the blocking region, either by RB_NOGVL_PENDING_INTERRUPT_FAIL (pending, possibly masked, interrupts) or by RB_NOGVL_INTR_FAIL.

So for this proposal, result == -1 && (errno == EINTR || errno == 0) means the operation was not executed, or was interrupted — in both cases the caller should unwind and let Ruby process interrupts.

An errno of 0 on a skipped callback is admittedly not very ergonomic (it forces the if (errno == 0) errno = EINTR; dance above before rb_sys_fail()). Making the skip paths report EINTR directly would remove that wrinkle, but it also erases the EINTR-vs-0 distinction above (you could no longer tell a truly-interrupted syscall from a never-run one). That trade-off is a separate concern and may be proposed independently; this proposal keeps the existing behaviour and stays focused on the new flag.

Implementation

A reference implementation is available:

Downstream user with feature detection and a fallback for older Rubies:

The focused CRuby C-API specs (spec/ruby/optional/capi/thread_spec.rb) cover the new flag: the function is not called when the current thread has masked pending interrupts, and errno is 0.

Updated by ioquatix (Samuel Williams) 5 days ago Actions #1

  • Description updated (diff)
  • Assignee set to ioquatix (Samuel Williams)
Actions

Also available in: PDF Atom