Project

General

Profile

Bug #21955

Updated by ioquatix (Samuel Williams) about 1 month ago

When a fiber terminates (falls off the end of its block, or raises an unhandled exception) after being reached via `Fiber#transfer`, its machine stack is **not** returned to the fiber pool. The stack is only freed when the `Fiber` object is eventually garbage collected. 

 In production, where major GC does not run regularly (or at all), continuously, terminated fibers accumulate unreleased stacks. Each stack allocation contains a guard page set with `mprotect(PROT_NONE)`, which splits a kernel VMA. On Linux this exhausts the per-process `vm.max_map_count` limit and raises: 

 ``` 
 FiberError: can't set a guard page: Cannot allocate memory 
 ``` 

 The symptom is confusing: the fibers are all dead (`alive? == false`), but new fibers cannot be created. created, but GC has not run. 

 ## Root cause 

 In `cont.c`, `fiber_switch()`, the eager stack release after `fiber_store` returns is guarded by `resuming_fiber`: 

 ```c 
 // cont.c (affected versions) 
 #ifndef COROUTINE_PTHREAD_CONTEXT 
     if (resuming_fiber && FIBER_TERMINATED_P(fiber)) { 
         RB_VM_LOCKING() { 
             fiber_stack_release(fiber); 
         } 
     } 
 #endif 
 ``` 

 `resuming_fiber` is only non-`NULL` when the switch was initiated by `Fiber#resume` (which passes `resuming_fiber = fiber`). `Fiber#transfer` passes `resuming_fiber = NULL`, so the condition is never true and the stack is silently leaked until GC. 

 Additionally, `fiber_raise` on a suspended non-yielding (transferred) fiber calls `fiber_transfer_kw`, also passing `resuming_fiber = NULL`, so the same leak occurs when a transferred fiber is terminated by a raised exception. 

 ## Reproduction 

 Set a low map count limit, then run the attached script: 

 ``` 
 sudo bash -c "echo 10000 > /proc/sys/vm/max_map_count" 
 ruby test_fiber_transfer_leak.rb 
 ``` 

 Restore afterwards (`sysctl vm.max_map_count` shows the default, typically 65530). 

 ```ruby 
 GC.disable 

 leaked = [] 
 count    = 0 

 begin 
   10_000.times do 
     f = Fiber.new { }    # terminates immediately 
     leaked << f          # hold reference so Fiber object is not GC'd 
     f.transfer           # transfer, not resume => stack NOT released (bug) 
     count += 1 
     puts "[#{count} fibers] all dead: #{leaked.none?(&:alive?)}" if count % 1000 == 0 
   end 
   puts "No error — fix is applied." 
 rescue FiberError => e 
   puts "FiberError after #{count} fibers: #{e.message}" 
   puts "All terminated (alive?=false): #{leaked.none?(&:alive?)}" 
 ensure 
   GC.enable 
   leaked.clear 
   GC.start 
 end 
 ``` 

 Expected output on **unpatched** Ruby (vm.max_map_count=10000): 

 ``` 
 [1000 fibers] all dead: true 
 [2000 fibers] all dead: true 
 [3000 fibers] all dead: true 
 [4000 fibers] all dead: true 

 FiberError after 4096 fibers: can't set a guard page: Cannot allocate memory 
 All terminated (alive?=false): false 
 ``` 

 Expected output on **patched** Ruby: 

 ``` 
 [1000 fibers] all dead: true 
 ... 
 [10000 fibers] all dead: true 
 No error — fix is applied. 
 ``` 

 ## Fix 

 Drop the `resuming_fiber &&` guard. After `fiber_store` returns we are executing in the caller's context — we are never on `fiber`'s stack — so releasing it is unconditionally safe. `fiber_stack_release` is already idempotent (guarded by `fiber->stack.base == NULL`), so the resume path is unaffected. 

 ```c 
 #ifndef COROUTINE_PTHREAD_CONTEXT 
     if (FIBER_TERMINATED_P(fiber)) { 
         RB_VM_LOCKING() { 
             fiber_stack_release(fiber); 
         } 
     } 
 #endif 
 ``` 

 Pull request with fix: https://github.com/ruby/ruby/pull/16416 

Back