Bug #17263
openFiber context switch degrades with number of fibers, limit on number of fibers
Description
I'm working on developing Polyphony, a Ruby gem for writing
highly-concurrent Ruby programs with fibers. In the course of my work I have
come up against two problems using Ruby fibers:
- Fiber context switching performance seem to degrade as the number of fibers
is increased. This is both withFiber#transfer
and
Fiber#resume/Fiber.yield
. - The number of concurrent fibers that can exist at any time seems to be
limited. Once a certain number is reached (on my system this seems to be
31744 fibers), callingFiber#transfer
will raise aFiberError
with the
messagecan't set a guard page: Cannot allocate memory
. This is not due to
RAM being saturated. With 10000 fibers, my test program hovers at around 150MB
RSS (on Ruby 2.7.1).
Here's a program for testing the performance of Fiber#transfer
:
# frozen_string_literal: true
require 'fiber'
class Fiber
attr_accessor :next
end
def run(num_fibers)
count = 0
GC.start
GC.disable
first = nil
last = nil
supervisor = Fiber.current
num_fibers.times do
fiber = Fiber.new do
loop do
count += 1
if count == 1_000_000
supervisor.transfer
else
Fiber.current.next.transfer
end
end
end
first ||= fiber
last.next = fiber if last
last = fiber
end
last.next = first
t0 = Time.now
first.transfer
elapsed = Time.now - t0
rss = `ps -o rss= -p #{Process.pid}`.to_i
puts "fibers: #{num_fibers} rss: #{rss} count: #{count} rate: #{count / elapsed}"
rescue Exception => e
puts "Stopped at #{count} fibers"
p e
end
run(100)
run(1000)
run(10000)
run(100000)
With Ruby 2.6.5 I'm getting:
fibers: 100 rss: 23212 count: 1000000 rate: 3357675.1688139187
fibers: 1000 rss: 31292 count: 1000000 rate: 2455537.056439736
fibers: 10000 rss: 127388 count: 1000000 rate: 954251.1674325482
Stopped at 22718 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>
With Ruby 2.7.1 I'm getting:
fibers: 100 rss: 23324 count: 1000000 rate: 3443916.967616508
fibers: 1000 rss: 34676 count: 1000000 rate: 2333315.3862491543
fibers: 10000 rss: 151364 count: 1000000 rate: 916772.1008060966
Stopped at 31744 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>
With ruby-head I get an almost identical result to that of 2.7.1.
As you can see, the performance degradation is similar in all the three versions
of Ruby, going from ~3.4M context switches per second for 100 fibers to less
then 1M context switches per second for 10000 fibers. Running with 100000 fibers
fails to complete.
Here's a program for testing the performance of Fiber#resume/Fiber.yield
:
# frozen_string_literal: true
require 'fiber'
class Fiber
attr_accessor :next
end
# This program shows how the performance of Fiber.transfer degrades as the fiber
# count increases
def run(num_fibers)
count = 0
GC.start
GC.disable
fibers = []
num_fibers.times do
fibers << Fiber.new { loop { Fiber.yield } }
end
t0 = Time.now
while count < 1000000
fibers.each do |f|
count += 1
f.resume
end
end
elapsed = Time.now - t0
puts "fibers: #{num_fibers} count: #{count} rate: #{count / elapsed}"
rescue Exception => e
puts "Stopped at #{count} fibers"
p e
end
run(100)
run(1000)
run(10000)
run(100000)
With Ruby 2.7.1 I'm getting the following output:
fibers: 100 count: 1000000 rate: 3048230.049946255
fibers: 1000 count: 1000000 rate: 2362235.6455160403
fibers: 10000 count: 1000000 rate: 950251.7621725246
Stopped at 21745 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>
As I understand it, theoretically at least switching between fibers should have
a constant cost in terms of CPU cycles, irrespective of the number of fibers
currently existing in memory. I am completely ignorant the implementation
details of Ruby fibers, so at least for now I don't have any idea where this
problem is coming from.
Updated by ioquatix (Samuel Williams) over 2 years ago
Regarding "can't set a guard page" it's because of your system is limiting the number of memory mapped segments. Each fiber stack requires a guard page and this is considered a separate memory map entry.
Try increasing it:
sysctl -w vm.max_map_count=1000000
Regarding fiber context switching performance, you are correct it should be constant time no matter the number of fibers. What is not constant time is garbage collection, but it looks like you disabled that. I don't know why you are having such an experience, I need to check for myself what is going on.
Updated by ioquatix (Samuel Williams) over 2 years ago
I can confirm I can reproduce your problem. As an experiment, I did:
first.transfer
count = 0
as warmup to see if it was some kind of lazy allocation issue, but it didn't seem to help. I'll have to take a look at the implementation. Thanks for bringing this to my attention.
Updated by ioquatix (Samuel Williams) over 2 years ago
On my computer, I found the following.
I changed your script to run with the given number of fibers as an argument.
> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k ./ruby ../test.rb 10000
fibers: 10000 rss: 143252 count: 1000000 rate: 1185760.5787558889
Performance counter stats for './ruby ../test.rb 10000':
37,162 page-faults
3,134,456,066 cpu-cycles:u
453,411,132 cpu-cycles:k
0.961123393 seconds time elapsed
0.846998000 seconds user
0.107914000 seconds sys
> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k ./ruby ../test.rb 100000
fibers: 100000 rss: 1302384 count: 1000000 rate: 670890.0627347956
Performance counter stats for './ruby ../test.rb 100000':
346,580 page-faults
3,989,630,633 cpu-cycles:u
4,084,402,671 cpu-cycles:k
2.151275080 seconds time elapsed
1.151921000 seconds user
0.989052000 seconds sys
> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k ./ruby ../test.rb 1000000
fibers: 1000000 rss: 12886240 count: 1000000 rate: 153311.38970854803
Performance counter stats for './ruby ../test.rb 1000000':
3,438,434 page-faults
8,948,808,388 cpu-cycles:u
35,212,243,043 cpu-cycles:k
11.706676724 seconds time elapsed
3.073442000 seconds user
8.496062000 seconds sys
:u
means user space and :k
means kernel space.
Even thought the cpu-cycles is roughtly the same (and user time), we can see the system time varies by an order of magnitude. I had to run several times to get this clear picture, but I believe the overhead is coming from page-faults as the stacks are swapped. The more stacks you have, the more page faults you have in your L1/L2 cache. I'm not sure if we can verify this further, but one way might be to change the defaults stack size. I'll play around with it a bit more. Certainly, a CPU with a bigger L1/L2 cache should perform better if this theory is true.
Updated by ioquatix (Samuel Williams) over 2 years ago
I've been meaning to revisit the x64 implementation to make it more memory friendly.
I played around with it today in my spare time. Here is updated implementation:
.globl PREFIXED_SYMBOL(SYMBOL_PREFIX,coroutine_transfer)
PREFIXED_SYMBOL(SYMBOL_PREFIX,coroutine_transfer):
# Make space on the stack for 7 registers:
subq $48, %rsp
# Save caller stack pointer
movq %rsp, (%rdi)
# Save caller state:
movq %rbp, 40(%rsp)
movq %rbx, 32(%rsp)
movq %r12, 24(%rsp)
movq %r13, 16(%rsp)
movq %r14, 8(%rsp)
movq %r15, (%rsp)
# Restore callee stack pointer
movq (%rsi), %rsp
# Move the return address into rax to start prefetching:
movq 48(%rsp), %rax
# Restore callee state
# Save caller state:
movq 40(%rsp), %rbp
movq 32(%rsp), %rbx
movq 24(%rsp), %r12
movq 16(%rsp), %r13
movq 8(%rsp), %r14
movq (%rsp), %r15
addq $56, %rsp
# Put the first argument into the return value
# movq %rdi, %rax
# We pop the return address and jump to it
jmpq %rax
# ret
It seems to have a minor performance improvement, but still fundamentally exhibiting the same page fault behaviour.
Updated by ciconia (Sharon Rosner) over 2 years ago
Please forgive me if this is a stupid question, but are fiber stacks in Ruby resizable? Can they grow or do they have a fixed size?
Updated by ioquatix (Samuel Williams) over 2 years ago
You do not need to preface your questions like that.
The fiber stack is a fixed size, but it's allocated using virtual memory and it's "released" using madvise(DONT_NEED)
which allows the kernel to release those pages. So whether you allocate fibers with a 1MiB stack or a 128MiB stack, the only difference is address space consumed which is almost free and the actual amount of stack used, rounded up to the nearest page. One part to be careful of is to ensure the GC knows the correct extent of the stack, otherwise it would page in the entire stack (mostly zeros).
Updated by rmosolgo (Robert Mosolgo) over 1 year ago
I heard someone ran into this error in a GraphQL-Ruby context, so I thought I'd check out this script on the latest Ruby. It didn't crash as-written, so I added a couple more orders of magnitude. It still finished fine locally, but slowed down in the same way described previously (iiuc). Here's the output of the script, reformatted for readability:
$ ruby -v
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin19]
$ ruby fibers.rb
fibers: 100 rss: 13788 count: 1000000 rate: 4792967.757705894
fibers: 1000 rss: 25424 count: 1000000 rate: 4186447.6317265746
fibers: 10000 rss: 143384 count: 1000000 rate: 1308239.5543612782
fibers: 100000 rss: 1312544 count: 1000000 rate: 746528.2702790672
fibers: 1000000 rss: 12983392 count: 1000000 rate: 147636.8216863137
fibers: 10000000 rss: 21913812 count: 1000000 rate: 63403.92197640169
Just thought I'd share the behavior on 3.1.0 in case anyone else comes checking on this issue!