Bug #21955
Updated by ioquatix (Samuel Williams) about 1 month ago
When a fiber terminates (falls off the end of its block, or raises an unhandled exception) after being reached via `Fiber#transfer`, its machine stack is **not** returned to the fiber pool. The stack is only freed when the `Fiber` object is eventually garbage collected. In production, where major GC does not run regularly (or at all), continuously, terminated fibers accumulate unreleased stacks. Each stack allocation contains a guard page set with `mprotect(PROT_NONE)`, which splits a kernel VMA. On Linux this exhausts the per-process `vm.max_map_count` limit and raises: ``` FiberError: can't set a guard page: Cannot allocate memory ``` The symptom is confusing: the fibers are all dead (`alive? == false`), but new fibers cannot be created. created, but GC has not run. ## Root cause In `cont.c`, `fiber_switch()`, the eager stack release after `fiber_store` returns is guarded by `resuming_fiber`: ```c // cont.c (affected versions) #ifndef COROUTINE_PTHREAD_CONTEXT if (resuming_fiber && FIBER_TERMINATED_P(fiber)) { RB_VM_LOCKING() { fiber_stack_release(fiber); } } #endif ``` `resuming_fiber` is only non-`NULL` when the switch was initiated by `Fiber#resume` (which passes `resuming_fiber = fiber`). `Fiber#transfer` passes `resuming_fiber = NULL`, so the condition is never true and the stack is silently leaked until GC. Additionally, `fiber_raise` on a suspended non-yielding (transferred) fiber calls `fiber_transfer_kw`, also passing `resuming_fiber = NULL`, so the same leak occurs when a transferred fiber is terminated by a raised exception. ## Reproduction Set a low map count limit, then run the attached script: ``` sudo bash -c "echo 10000 > /proc/sys/vm/max_map_count" ruby test_fiber_transfer_leak.rb ``` Restore afterwards (`sysctl vm.max_map_count` shows the default, typically 65530). ```ruby GC.disable leaked = [] count = 0 begin 10_000.times do f = Fiber.new { } # terminates immediately leaked << f # hold reference so Fiber object is not GC'd f.transfer # transfer, not resume => stack NOT released (bug) count += 1 puts "[#{count} fibers] all dead: #{leaked.none?(&:alive?)}" if count % 1000 == 0 end puts "No error — fix is applied." rescue FiberError => e puts "FiberError after #{count} fibers: #{e.message}" puts "All terminated (alive?=false): #{leaked.none?(&:alive?)}" ensure GC.enable leaked.clear GC.start end ``` Expected output on **unpatched** Ruby (vm.max_map_count=10000): ``` [1000 fibers] all dead: true [2000 fibers] all dead: true [3000 fibers] all dead: true [4000 fibers] all dead: true FiberError after 4096 fibers: can't set a guard page: Cannot allocate memory All terminated (alive?=false): false ``` Expected output on **patched** Ruby: ``` [1000 fibers] all dead: true ... [10000 fibers] all dead: true No error — fix is applied. ``` ## Fix Drop the `resuming_fiber &&` guard. After `fiber_store` returns we are executing in the caller's context — we are never on `fiber`'s stack — so releasing it is unconditionally safe. `fiber_stack_release` is already idempotent (guarded by `fiber->stack.base == NULL`), so the resume path is unaffected. ```c #ifndef COROUTINE_PTHREAD_CONTEXT if (FIBER_TERMINATED_P(fiber)) { RB_VM_LOCKING() { fiber_stack_release(fiber); } } #endif ``` Pull request with fix: https://github.com/ruby/ruby/pull/16416