Project

General

Profile

Actions

Bug #21955

closed

`Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page`

Bug #21955: `Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page`

Added by ioquatix (Samuel Williams) 3 days ago. Updated 2 days ago.


Description

When a fiber terminates (falls off the end of its block, or raises an unhandled exception) after being reached via Fiber#transfer, its machine stack is not returned to the fiber pool. The stack is only freed when the Fiber object is eventually garbage collected.

In production, where major GC does not run regularly (or at all), terminated fibers accumulate unreleased stacks. Each stack allocation contains a guard page set with mprotect(PROT_NONE), which splits a kernel VMA. On Linux this exhausts the per-process vm.max_map_count limit and raises:

FiberError: can't set a guard page: Cannot allocate memory

The symptom is confusing: the fibers are all dead (alive? == false), but new fibers cannot be created.

Root cause

In cont.c, fiber_switch(), the eager stack release after fiber_store returns is guarded by resuming_fiber:

// cont.c (affected versions)
#ifndef COROUTINE_PTHREAD_CONTEXT
    if (resuming_fiber && FIBER_TERMINATED_P(fiber)) {
        RB_VM_LOCKING() {
            fiber_stack_release(fiber);
        }
    }
#endif

resuming_fiber is only non-NULL when the switch was initiated by Fiber#resume (which passes resuming_fiber = fiber). Fiber#transfer passes resuming_fiber = NULL, so the condition is never true and the stack is silently leaked until GC.

Additionally, fiber_raise on a suspended non-yielding (transferred) fiber calls fiber_transfer_kw, also passing resuming_fiber = NULL, so the same leak occurs when a transferred fiber is terminated by a raised exception.

Reproduction

Set a low map count limit, then run the attached script:

sudo bash -c "echo 10000 > /proc/sys/vm/max_map_count"
ruby test_fiber_transfer_leak.rb

Restore afterwards (sysctl vm.max_map_count shows the default, typically 65530).

GC.disable

leaked = []
count  = 0

begin
  10_000.times do
    f = Fiber.new { }  # terminates immediately
    leaked << f        # hold reference so Fiber object is not GC'd
    f.transfer         # transfer, not resume => stack NOT released (bug)
    count += 1
    puts "[#{count} fibers] all dead: #{leaked.none?(&:alive?)}" if count % 1000 == 0
  end
  puts "No error — fix is applied."
rescue FiberError => e
  puts "FiberError after #{count} fibers: #{e.message}"
  puts "All terminated (alive?=false): #{leaked.none?(&:alive?)}"
ensure
  GC.enable
  leaked.clear
  GC.start
end

Expected output on unpatched Ruby (vm.max_map_count=10000):

[1000 fibers] all dead: true
[2000 fibers] all dead: true
[3000 fibers] all dead: true
[4000 fibers] all dead: true

FiberError after 4096 fibers: can't set a guard page: Cannot allocate memory
All terminated (alive?=false): false

Expected output on patched Ruby:

[1000 fibers] all dead: true
...
[10000 fibers] all dead: true
No error — fix is applied.

Fix

Drop the resuming_fiber && guard. After fiber_store returns we are executing in the caller's context — we are never on fiber's stack — so releasing it is unconditionally safe. fiber_stack_release is already idempotent (guarded by fiber->stack.base == NULL), so the resume path is unaffected.

#ifndef COROUTINE_PTHREAD_CONTEXT
    if (FIBER_TERMINATED_P(fiber)) {
        RB_VM_LOCKING() {
            fiber_stack_release(fiber);
        }
    }
#endif

Pull request with fix: https://github.com/ruby/ruby/pull/16416

Updated by Anonymous 3 days ago Actions #1

  • Status changed from Open to Closed

Applied in changeset git|dc1777d01770ab62ec99ff6fa4cf622098f44968.


Ensure fiber stack is freed in all cases, if the fiber is terminated. (#16416)

[Bug #21955]

Updated by ioquatix (Samuel Williams) 3 days ago Actions #2

  • Description updated (diff)

Updated by rwstauner (Randy Stauner) 2 days ago Actions #4

  • Backport changed from 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED to 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: DONE
Actions

Also available in: PDF Atom