Bug #21955: `Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page` - Ruby - Ruby Issue Tracking System

Actions

Bug #21955

closed

`Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page`

Bug #21955: `Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page`

Added by ioquatix (Samuel Williams) 4 months ago. Updated about 2 months ago.

Status:

Closed

Assignee:

ioquatix (Samuel Williams)

Target version:

ruby -v:

Backport:

3.2: REQUIRED, 3.3: REQUIRED, 3.4: DONE, 4.0: DONE

[ruby-core:125045]

Description

When a fiber terminates (falls off the end of its block, or raises an unhandled exception) after being reached via Fiber#transfer, its machine stack is not returned to the fiber pool. The stack is only freed when the Fiber object is eventually garbage collected.

In production, where major GC does not run regularly (or at all), terminated fibers accumulate unreleased stacks. Each stack allocation contains a guard page set with mprotect(PROT_NONE), which splits a kernel VMA. On Linux this exhausts the per-process vm.max_map_count limit and raises:

FiberError: can't set a guard page: Cannot allocate memory

The symptom is confusing: the fibers are all dead (alive? == false), but new fibers cannot be created.

Root cause¶

In cont.c, fiber_switch(), the eager stack release after fiber_store returns is guarded by resuming_fiber:

// cont.c (affected versions)
#ifndef COROUTINE_PTHREAD_CONTEXT
    if (resuming_fiber && FIBER_TERMINATED_P(fiber)) {
        RB_VM_LOCKING() {
            fiber_stack_release(fiber);
        }
    }
#endif

resuming_fiber is only non-NULL when the switch was initiated by Fiber#resume (which passes resuming_fiber = fiber). Fiber#transfer passes resuming_fiber = NULL, so the condition is never true and the stack is silently leaked until GC.

Additionally, fiber_raise on a suspended non-yielding (transferred) fiber calls fiber_transfer_kw, also passing resuming_fiber = NULL, so the same leak occurs when a transferred fiber is terminated by a raised exception.

Reproduction¶

Set a low map count limit, then run the attached script:

sudo bash -c "echo 10000 > /proc/sys/vm/max_map_count"
ruby test_fiber_transfer_leak.rb

Restore afterwards (sysctl vm.max_map_count shows the default, typically 65530).

GC.disable

leaked = []
count  = 0

begin
  10_000.times do
    f = Fiber.new { }  # terminates immediately
    leaked << f        # hold reference so Fiber object is not GC'd
    f.transfer         # transfer, not resume => stack NOT released (bug)
    count += 1
    puts "[#{count} fibers] all dead: #{leaked.none?(&:alive?)}" if count % 1000 == 0
  end
  puts "No error — fix is applied."
rescue FiberError => e
  puts "FiberError after #{count} fibers: #{e.message}"
  puts "All terminated (alive?=false): #{leaked.none?(&:alive?)}"
ensure
  GC.enable
  leaked.clear
  GC.start
end

Expected output on unpatched Ruby (vm.max_map_count=10000):

[1000 fibers] all dead: true
[2000 fibers] all dead: true
[3000 fibers] all dead: true
[4000 fibers] all dead: true

FiberError after 4096 fibers: can't set a guard page: Cannot allocate memory
All terminated (alive?=false): false

Expected output on patched Ruby:

[1000 fibers] all dead: true
...
[10000 fibers] all dead: true
No error — fix is applied.

Fix¶

Drop the resuming_fiber && guard. After fiber_store returns we are executing in the caller's context — we are never on fiber's stack — so releasing it is unconditionally safe. fiber_stack_release is already idempotent (guarded by fiber->stack.base == NULL), so the resume path is unaffected.

#ifndef COROUTINE_PTHREAD_CONTEXT
    if (FIBER_TERMINATED_P(fiber)) {
        RB_VM_LOCKING() {
            fiber_stack_release(fiber);
        }
    }
#endif

Pull request with fix: https://github.com/ruby/ruby/pull/16416

Updated by Anonymous 4 months ago Actions
Copy link
#1

Status changed from Open to Closed

Applied in changeset git|dc1777d01770ab62ec99ff6fa4cf622098f44968.

Ensure fiber stack is freed in all cases, if the fiber is terminated. (#16416)

[Bug #21955]

Updated by ioquatix (Samuel Williams) 4 months ago Actions
Copy link
#2

Description updated (diff)

Updated by rwstauner (Randy Stauner) 4 months ago Actions
Copy link
#3 [ruby-core:125053]

backport PR for 4.0: https://github.com/ruby/ruby/pull/16422

Updated by rwstauner (Randy Stauner) 4 months ago Actions
Copy link
#4

Backport changed from 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: REQUIRED to 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: DONE

Updated by nagachika (Tomoyuki Chikanaga) about 2 months ago Actions
Copy link
#5 [ruby-core:125570]

Backport changed from 3.2: REQUIRED, 3.3: REQUIRED, 3.4: REQUIRED, 4.0: DONE to 3.2: REQUIRED, 3.3: REQUIRED, 3.4: DONE, 4.0: DONE

ruby_3_4 148f263a0d299e4532c753470a282144c9104820 merged revision(s) dc1777d01770ab62ec99ff6fa4cf622098f44968.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #21955

`Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page`

Root cause¶

Reproduction¶

Fix¶

Updated by Anonymous 4 months ago Actions
Copy link
#1

Updated by ioquatix (Samuel Williams) 4 months ago Actions
Copy link
#2

Updated by rwstauner (Randy Stauner) 4 months ago Actions
Copy link
#3 [ruby-core:125053]

Updated by rwstauner (Randy Stauner) 4 months ago Actions
Copy link
#4

Updated by nagachika (Tomoyuki Chikanaga) about 2 months ago Actions
Copy link
#5 [ruby-core:125570]

Project

General

Profile

Ruby

Custom queries

Bug #21955

`Fiber#transfer`: machine stack not released when fiber terminates, causing `FiberError: can't set a guard page`

Root cause¶

Reproduction¶

Fix¶

Updated by Anonymous 4 months ago ActionsCopy link #1

Updated by ioquatix (Samuel Williams) 4 months ago ActionsCopy link #2

Updated by rwstauner (Randy Stauner) 4 months ago ActionsCopy link #3 [ruby-core:125053]

Updated by rwstauner (Randy Stauner) 4 months ago ActionsCopy link #4

Updated by nagachika (Tomoyuki Chikanaga) about 2 months ago ActionsCopy link #5 [ruby-core:125570]

Updated by Anonymous 4 months ago Actions
Copy link
#1

Updated by ioquatix (Samuel Williams) 4 months ago Actions
Copy link
#2

Updated by rwstauner (Randy Stauner) 4 months ago Actions
Copy link
#3 [ruby-core:125053]

Updated by rwstauner (Randy Stauner) 4 months ago Actions
Copy link
#4

Updated by nagachika (Tomoyuki Chikanaga) about 2 months ago Actions
Copy link
#5 [ruby-core:125570]