Project

General

Profile

Bug #17263

Fiber context switch degrades with number of fibers, limit on number of fibers

Added by ciconia (Sharon Rosner) about 2 months ago. Updated about 1 month ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:100401]

Description

I'm working on developing Polyphony, a Ruby gem for writing
highly-concurrent Ruby programs with fibers. In the course of my work I have
come up against two problems using Ruby fibers:

  1. Fiber context switching performance seem to degrade as the number of fibers is increased. This is both with Fiber#transfer and Fiber#resume/Fiber.yield.
  2. The number of concurrent fibers that can exist at any time seems to be limited. Once a certain number is reached (on my system this seems to be 31744 fibers), calling Fiber#transfer will raise a FiberError with the message can't set a guard page: Cannot allocate memory. This is not due to RAM being saturated. With 10000 fibers, my test program hovers at around 150MB RSS (on Ruby 2.7.1).

Here's a program for testing the performance of Fiber#transfer:

# frozen_string_literal: true

require 'fiber'

class Fiber
  attr_accessor :next
end

def run(num_fibers)
  count = 0

  GC.start
  GC.disable

  first = nil
  last = nil
  supervisor = Fiber.current
  num_fibers.times do
    fiber = Fiber.new do
      loop do
        count += 1
        if count == 1_000_000
          supervisor.transfer
        else
          Fiber.current.next.transfer
        end
      end
    end
    first ||= fiber
    last.next = fiber if last
    last = fiber
  end

  last.next = first

  t0 = Time.now
  first.transfer
  elapsed = Time.now - t0

  rss = `ps -o rss= -p #{Process.pid}`.to_i

  puts "fibers: #{num_fibers} rss: #{rss} count: #{count} rate: #{count / elapsed}"
rescue Exception => e
  puts "Stopped at #{count} fibers"
  p e
end

run(100)
run(1000)
run(10000)
run(100000)

With Ruby 2.6.5 I'm getting:

fibers: 100 rss: 23212 count: 1000000 rate: 3357675.1688139187
fibers: 1000 rss: 31292 count: 1000000 rate: 2455537.056439736
fibers: 10000 rss: 127388 count: 1000000 rate: 954251.1674325482
Stopped at 22718 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>

With Ruby 2.7.1 I'm getting:

fibers: 100 rss: 23324 count: 1000000 rate: 3443916.967616508
fibers: 1000 rss: 34676 count: 1000000 rate: 2333315.3862491543
fibers: 10000 rss: 151364 count: 1000000 rate: 916772.1008060966
Stopped at 31744 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>

With ruby-head I get an almost identical result to that of 2.7.1.

As you can see, the performance degradation is similar in all the three versions
of Ruby, going from ~3.4M context switches per second for 100 fibers to less
then 1M context switches per second for 10000 fibers. Running with 100000 fibers
fails to complete.

Here's a program for testing the performance of Fiber#resume/Fiber.yield:

# frozen_string_literal: true

require 'fiber'

class Fiber
  attr_accessor :next
end

# This program shows how the performance of Fiber.transfer degrades as the fiber
# count increases

def run(num_fibers)
  count = 0

  GC.start
  GC.disable

  fibers = []
  num_fibers.times do
    fibers << Fiber.new { loop { Fiber.yield } }
  end

  t0 = Time.now

  while count < 1000000
    fibers.each do |f|
      count += 1
      f.resume
    end
  end

  elapsed = Time.now - t0

  puts "fibers: #{num_fibers} count: #{count} rate: #{count / elapsed}"
rescue Exception => e
  puts "Stopped at #{count} fibers"
  p e
end

run(100)
run(1000)
run(10000)
run(100000)

With Ruby 2.7.1 I'm getting the following output:

fibers: 100 count: 1000000 rate: 3048230.049946255
fibers: 1000 count: 1000000 rate: 2362235.6455160403
fibers: 10000 count: 1000000 rate: 950251.7621725246
Stopped at 21745 fibers
#<FiberError: can't set a guard page: Cannot allocate memory>

As I understand it, theoretically at least switching between fibers should have
a constant cost in terms of CPU cycles, irrespective of the number of fibers
currently existing in memory. I am completely ignorant the implementation
details of Ruby fibers, so at least for now I don't have any idea where this
problem is coming from.

Updated by ioquatix (Samuel Williams) about 2 months ago

Regarding "can't set a guard page" it's because of your system is limiting the number of memory mapped segments. Each fiber stack requires a guard page and this is considered a separate memory map entry.

Try increasing it:

sysctl -w vm.max_map_count=1000000

Regarding fiber context switching performance, you are correct it should be constant time no matter the number of fibers. What is not constant time is garbage collection, but it looks like you disabled that. I don't know why you are having such an experience, I need to check for myself what is going on.

Updated by ioquatix (Samuel Williams) about 2 months ago

I can confirm I can reproduce your problem. As an experiment, I did:

    first.transfer
    count = 0

as warmup to see if it was some kind of lazy allocation issue, but it didn't seem to help. I'll have to take a look at the implementation. Thanks for bringing this to my attention.

#3

Updated by Eregon (Benoit Daloze) about 2 months ago

  • Description updated (diff)

Updated by ioquatix (Samuel Williams) about 2 months ago

On my computer, I found the following.

I changed your script to run with the given number of fibers as an argument.

> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k ./ruby ../test.rb 10000
fibers: 10000 rss: 143252 count: 1000000 rate: 1185760.5787558889

 Performance counter stats for './ruby ../test.rb 10000':

            37,162      page-faults                                                 
     3,134,456,066      cpu-cycles:u                                                
       453,411,132      cpu-cycles:k                                                

       0.961123393 seconds time elapsed

       0.846998000 seconds user
       0.107914000 seconds sys


> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k ./ruby ../test.rb 100000
fibers: 100000 rss: 1302384 count: 1000000 rate: 670890.0627347956

 Performance counter stats for './ruby ../test.rb 100000':

           346,580      page-faults                                                 
     3,989,630,633      cpu-cycles:u                                                
     4,084,402,671      cpu-cycles:k                                                

       2.151275080 seconds time elapsed

       1.151921000 seconds user
       0.989052000 seconds sys


> sudo perf stat -e page-faults,cpu-cycles:u,cpu-cycles:k ./ruby ../test.rb 1000000
fibers: 1000000 rss: 12886240 count: 1000000 rate: 153311.38970854803

 Performance counter stats for './ruby ../test.rb 1000000':

         3,438,434      page-faults                                                 
     8,948,808,388      cpu-cycles:u                                                
    35,212,243,043      cpu-cycles:k                                                

      11.706676724 seconds time elapsed

       3.073442000 seconds user
       8.496062000 seconds sys

:u means user space and :k means kernel space.

Even thought the cpu-cycles is roughtly the same (and user time), we can see the system time varies by an order of magnitude. I had to run several times to get this clear picture, but I believe the overhead is coming from page-faults as the stacks are swapped. The more stacks you have, the more page faults you have in your L1/L2 cache. I'm not sure if we can verify this further, but one way might be to change the defaults stack size. I'll play around with it a bit more. Certainly, a CPU with a bigger L1/L2 cache should perform better if this theory is true.

Updated by ioquatix (Samuel Williams) about 2 months ago

I've been meaning to revisit the x64 implementation to make it more memory friendly.

I played around with it today in my spare time. Here is updated implementation:

.globl PREFIXED_SYMBOL(SYMBOL_PREFIX,coroutine_transfer)
PREFIXED_SYMBOL(SYMBOL_PREFIX,coroutine_transfer):
    # Make space on the stack for 7 registers:
    subq $48, %rsp

    # Save caller stack pointer
    movq %rsp, (%rdi)

    # Save caller state:
    movq %rbp, 40(%rsp)
    movq %rbx, 32(%rsp)
    movq %r12, 24(%rsp)
    movq %r13, 16(%rsp)
    movq %r14, 8(%rsp)
    movq %r15, (%rsp)

    # Restore callee stack pointer
    movq (%rsi), %rsp

    # Move the return address into rax to start prefetching:
    movq 48(%rsp), %rax

    # Restore callee state
    # Save caller state:
    movq 40(%rsp), %rbp
    movq 32(%rsp), %rbx
    movq 24(%rsp), %r12
    movq 16(%rsp), %r13
    movq 8(%rsp), %r14
    movq (%rsp), %r15

    addq $56, %rsp

    # Put the first argument into the return value
    # movq %rdi, %rax

    # We pop the return address and jump to it
    jmpq %rax
    # ret

It seems to have a minor performance improvement, but still fundamentally exhibiting the same page fault behaviour.

Updated by ciconia (Sharon Rosner) about 1 month ago

Please forgive me if this is a stupid question, but are fiber stacks in Ruby resizable? Can they grow or do they have a fixed size?

Updated by ioquatix (Samuel Williams) about 1 month ago

You do not need to preface your questions like that.

The fiber stack is a fixed size, but it's allocated using virtual memory and it's "released" using madvise(DONT_NEED) which allows the kernel to release those pages. So whether you allocate fibers with a 1MiB stack or a 128MiB stack, the only difference is address space consumed which is almost free and the actual amount of stack used, rounded up to the nearest page. One part to be careful of is to ensure the GC knows the correct extent of the stack, otherwise it would page in the entire stack (mostly zeros).

Also available in: Atom PDF