Project

General

Profile

Actions

Bug #18760

open

Ractors vs "skynet" microbenchmark

Added by zverok (Victor Shepelev) about 2 months ago. Updated about 2 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:108430]

Description

I recently stumbled upon skynet concurrency microbenchmark and tried to adapt it to Ruby ractors.
The microbenchmarks of this kind are obviously limited, but at least they allow to do super-rough estimations of "how we are doing". It just spawns an actor (or coroutine, or channel, etc., depending on the language), which spawns 10 more, each spawning 10 more... till there are a million of them; then each just returns its ordinal number, and parents sum them up.

Ruby code:

N = 1_000_000

def skynet(num, size, div)
  if size == 1
    Ractor.yield num
  else
    children = div.times.map { |i|
      sub_num = num + i*(size/div)
      Ractor.new(sub_num, size/div, div) { |n, sz, d| skynet(n, sz, d) }
    }
    sum = children.sum(&:take)
    Ractor.yield sum
  end
end

start_time = Time.now
main = Ractor.new { skynet(0, N, 10) }
result = main.take
end_time = Time.now
puts "Result: #{result} in #{(end_time - start_time)}s."

(of course, it is better to use Process.getclocktime probably, but that's not the point at the moment)

The point is, what I found out seems to indicate we still have some room for improvement for Ractors, to say the least.

$ ruby -v
ruby 3.2.0dev (2022-04-28T06:44:02Z master 5250210aa9) [x86_64-linux]

Results on my machine (Thinkpad E300, 8Gb mem):

  • N=1000: Result: 499500 in 1.614297829s. (BTW, on ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-linux] I also checked, it was much faster: 0.3-0.4s typically)
  • N=10_000: Result: 49995000 in 46.89935975s. (really long?)
  • N=100_000: Aborted (core dumped)

Not sure this is telling something useful (and not sure I ported benchmark properly, TBH)

Updated by Eregon (Benoit Daloze) about 2 months ago

Each Ractor currently uses one pthread behind the scenes, so you're spawning N threads and that's of course slow.
This benchmark works likely a lot better with Fiber instead of Ractor.

However I heard @ko1 (Koichi Sasada) is working on pooling multiple Ractors per native thread, which would help for this microbenchmark.
(This microbenchmark is rather silly if you ask me, because it's like the worst case overhead for concurrency primitives, not representative of anything real world).

Updated by zverok (Victor Shepelev) about 2 months ago

this microbenchmark (which is rather silly if you ask me, because it's like the worst case overhead for concurrency primitives, not representative of anything real world)

I think that's the point (as it is for many microbenchmarks): demonstrate edge case performance for some technology. To my knowledge, boost::fiber of C++ and OpenJDK use this benchmark, so I was interested in what it will show.

To the best of my understanding, using Fibers wouldn't be appropriate as they don't demonstrate true concurrency? (But I might be wrong, my understanding of concurrency/parallelism is flawed, to say the least.)

Updated by duerst (Martin Dürst) about 2 months ago

Eregon (Benoit Daloze) wrote in #note-1:

However I heard @ko1 (Koichi Sasada) is working on pooling multiple Ractors per native thread, which would help for this microbenchmark (which is rather silly if you ask me, because it's like the worst case overhead for concurrency primitives, not representative of anything real world).

I guess that with silly, you mean the benchmark, not pooling of multiple Ractors.

Updated by Eregon (Benoit Daloze) about 2 months ago

@duerst (Martin Dürst) Yes, of course, I only criticize the benchmark.

@zverok (Victor Shepelev) Note how the README doesn't mention anything based on N native threads?
That's because this benchmark doesn't make sense for that case, and would fail too with N=100000 if you used pthreads directly in C.
Fibers are a concurrency primitive but indeed Fibers of a given Thread don't run in parallel.
(Since Threads don't run in parallel on CRuby it also means Fibers are never run in parallel on CRuby but on JRuby/TruffleRuby Fibers of different Threads do run in parallel)

Some of the "Futures / promises" as well as some of the "Coroutines / fibers / channels" results on the README also likely don't run in parallel.

So basically this benchmark assumes a concurrency primitive which is pooled on number of native threads = number of cores, and measuring anything else is comparing apples and oranges.
CRuby doesn't have that currently, but might have it in the future with @ko1 (Koichi Sasada) M:N's proposal.

Updated by shan (Shannon Skipper) about 2 months ago

@zverok (Victor Shepelev) here's a Ruby example with Fibers that I ported from skynet.cr.

require 'async'
require 'async/queue'

def skynet(c, num, size, div)
  if size == 1
    c << num
  else
    rc = Async::LimitedQueue.new(div)
    sum = 0
    div.times do |i|
      sub_num = num + (i * (size / div))
      Async { skynet(rc, sub_num, size / div, div) }
    end
    div.times do
      sum += rc.dequeue
    end
    c << sum
  end
end

c = Async::LimitedQueue.new(1)
start_time = Time.now
Sync { skynet(c, 0, 1_000_000, 10) }
result = Sync { c.dequeue }
end_time = Time.now
puts "Result: #{result} in #{end_time - start_time}s."
Actions

Also available in: Atom PDF