Bug #21955
closed`Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page`
Description
When a fiber terminates (falls off the end of its block, or raises an unhandled exception) after being reached via Fiber#transfer, its machine stack is not returned to the fiber pool. The stack is only freed when the Fiber object is eventually garbage collected.
In production, where major GC does not run regularly (or at all), terminated fibers accumulate unreleased stacks. Each stack allocation contains a guard page set with mprotect(PROT_NONE), which splits a kernel VMA. On Linux this exhausts the per-process vm.max_map_count limit and raises:
FiberError: can't set a guard page: Cannot allocate memory
The symptom is confusing: the fibers are all dead (alive? == false), but new fibers cannot be created.
Root cause¶
In cont.c, fiber_switch(), the eager stack release after fiber_store returns is guarded by resuming_fiber:
// cont.c (affected versions)
#ifndef COROUTINE_PTHREAD_CONTEXT
if (resuming_fiber && FIBER_TERMINATED_P(fiber)) {
RB_VM_LOCKING() {
fiber_stack_release(fiber);
}
}
#endif
resuming_fiber is only non-NULL when the switch was initiated by Fiber#resume (which passes resuming_fiber = fiber). Fiber#transfer passes resuming_fiber = NULL, so the condition is never true and the stack is silently leaked until GC.
Additionally, fiber_raise on a suspended non-yielding (transferred) fiber calls fiber_transfer_kw, also passing resuming_fiber = NULL, so the same leak occurs when a transferred fiber is terminated by a raised exception.
Reproduction¶
Set a low map count limit, then run the attached script:
sudo bash -c "echo 10000 > /proc/sys/vm/max_map_count"
ruby test_fiber_transfer_leak.rb
Restore afterwards (sysctl vm.max_map_count shows the default, typically 65530).
GC.disable
leaked = []
count = 0
begin
10_000.times do
f = Fiber.new { } # terminates immediately
leaked << f # hold reference so Fiber object is not GC'd
f.transfer # transfer, not resume => stack NOT released (bug)
count += 1
puts "[#{count} fibers] all dead: #{leaked.none?(&:alive?)}" if count % 1000 == 0
end
puts "No error — fix is applied."
rescue FiberError => e
puts "FiberError after #{count} fibers: #{e.message}"
puts "All terminated (alive?=false): #{leaked.none?(&:alive?)}"
ensure
GC.enable
leaked.clear
GC.start
end
Expected output on unpatched Ruby (vm.max_map_count=10000):
[1000 fibers] all dead: true
[2000 fibers] all dead: true
[3000 fibers] all dead: true
[4000 fibers] all dead: true
FiberError after 4096 fibers: can't set a guard page: Cannot allocate memory
All terminated (alive?=false): false
Expected output on patched Ruby:
[1000 fibers] all dead: true
...
[10000 fibers] all dead: true
No error — fix is applied.
Fix¶
Drop the resuming_fiber && guard. After fiber_store returns we are executing in the caller's context — we are never on fiber's stack — so releasing it is unconditionally safe. fiber_stack_release is already idempotent (guarded by fiber->stack.base == NULL), so the resume path is unaffected.
#ifndef COROUTINE_PTHREAD_CONTEXT
if (FIBER_TERMINATED_P(fiber)) {
RB_VM_LOCKING() {
fiber_stack_release(fiber);
}
}
#endif
Pull request with fix: https://github.com/ruby/ruby/pull/16416
Updated by Anonymous 3 days ago
- Status changed from Open to Closed
Applied in changeset git|dc1777d01770ab62ec99ff6fa4cf622098f44968.
Ensure fiber stack is freed in all cases, if the fiber is terminated. (#16416)
[Bug #21955]
Updated by ioquatix (Samuel Williams) 3 days ago
- Description updated (diff)
Updated by rwstauner (Randy Stauner) 2 days ago
backport PR for 4.0: https://github.com/ruby/ruby/pull/16422
Updated by rwstauner (Randy Stauner) 2 days ago
- Backport changed from 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED to 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: DONE