Project

General

Profile

Feature #14739

Improve fiber yield/resume performance

Added by ioquatix (Samuel Williams) 9 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Target version:
[ruby-core:86879]

Description

I am interested to improve Fiber yield/resume performance.

I've used this library before: http://software.schmorp.de/pkg/libcoro.html and handled millions of HTTP requests using it.

I'd suggest to use that library.

As this is used in many places in Ruby (e.g. enumerable) it could be a big performance win across the board.

Here is a nice summary of what was done for RethinkDB: https://rethinkdb.com/blog/making-coroutines-fast/

Does Ruby currently reuse stacks? This is also a big performance win if it's not being done already.

History

Updated by shyouhei (Shyouhei Urabe) 9 months ago

ioquatix (Samuel Williams) wrote:

Does Ruby currently reuse stacks?

Yes.

Not sure how fast libcoro is, though.

Updated by ioquatix (Samuel Williams) 9 months ago

Here is the code https://github.com/ioquatix/ruby/tree/fiber-libcoro

UPDATE: I provided some benchmark details, but it turns out they were wrong. I've retracted it until I can provide correct information to prevent any confusion.

Updated by ioquatix (Samuel Williams) 9 months ago

Not sure how fast libcoro is, though.

In my experience, the libcoro ASM implementation is the fastest implementation I found.

It's not much slower than a (normal) C function call.

Updated by ioquatix (Samuel Williams) 9 months ago

  # Without libcoro
  koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
  setup time for 10000 fibers:   0.099961
  execution time for 1000 messages:  19.505909

  # With libcoro
  koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
  setup time for 10000 fibers:   0.099268
  execution time for 1000 messages:   8.491746

It's about 2.2x faster.

That's about what I was expecting.

Can someone else confirm? Thanks.

Updated by ioquatix (Samuel Williams) 9 months ago

# Without libcoro (macOS)

^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.298039
execution time for 1000 messages:  35.248941

# With libcoro (macOS)

^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.167117
execution time for 1000 messages:  15.460046

On macOS, it's about the same, 2.2x faster.

Updated by ioquatix (Samuel Williams) 9 months ago

I don't know how to run a full benchmark of Ruby. Can someone help me with that? It would be interesting to get a more general idea of the performance.

Updated by vo.x (Vit Ondruch) 9 months ago

I wonder what architectures libcoro supports? It seems it supports x86 a probably some ARM, but what about s390x and ppc64?

Updated by nobu (Nobuyoshi Nakada) 9 months ago

And seems it requires gcc (variants) and non-Windows.
coro.c can't compile with Visual C nor mingw gcc.
Also, asm needed to be replaced with __asm__ to compile with Apple clang, and it is 3% faster.

$ ruby fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.227721
execution time for 1000 messages:  74.540142

$ ./ruby fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.293740
execution time for 1000 messages:  72.180107

https://github.com/nobu/ruby/tree/feature/fiber-libcoro

Updated by ioquatix (Samuel Williams) 9 months ago

You can see the supported methods here.

https://github.com/ioquatix/ruby/blob/4a9c12d94aae1cf3a52ca5f026432cd03e9817bc/libcoro/coro.h#L99-L164

For the proof of concept, I forced it to use the ASM method, which supports 32-bit and 64-bit x86 CPUs and ARM (I've never tested it).

It would make sense to set up some configure tests to detect which one is available.

I'd also suggest if we move forward with this, we should remove most of the native implementation of coroutines in Ruby because they are slower and clutter up the implementation.

Updated by ioquatix (Samuel Williams) 9 months ago

I've compiled this on both LLVM and GCC just fine.

I've never tried compiling it on Windows but it should work. It might require some work.

Also, asm needed to be replaced with __asm__ to compile with Apple clang

I didn't have this problem. What version of the developer tools are you using?

and it is 3% faster.

If you get that, something is wrong, it's definitely a much bigger improvement than that. Did you try it on Linux?

Updated by ioquatix (Samuel Williams) 9 months ago

I am trying out your branch, and will report back. 3% is within the margin for error so it sounds like nothing changed for some reason. There will be some explanation.

Updated by nobu (Nobuyoshi Nakada) 9 months ago

ioquatix (Samuel Williams) wrote:

Also, asm needed to be replaced with __asm__ to compile with Apple clang

I didn't have this problem. What version of the developer tools are you using?

$ clang --version
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

If you get that, something is wrong, it's definitely a much bigger improvement than that. Did you try it on Linux?

On Ubuntu 18.04, it has the effect with gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3).

trunk

$ ./x86_64-linux/exe/ruby src/fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.153903
execution time for 1000 messages:  25.395488

fiber-libcoro

$ make -C x86_64-linux prog > /dev/null && ./x86_64-linux/exe/ruby src/fiber_benchmark.rb 10000 1000
In file included from ../src/libcoro/coro.c:41:0,
                 from ../src/cont.c:51:
../src/cont.c: In function ‘cont_free’:
../src/libcoro/coro.h:401:28: warning: statement with no effect [-Wunused-value]
 # define coro_destroy(ctx) (void *)(ctx)
                            ^~~~~~~~~~~~~
../src/cont.c:370:2: note: in expansion of macro ‘coro_destroy’
  coro_destroy((coro_context *)&fib->context);
  ^~~~~~~~~~~~
../src/cont.c: In function ‘fiber_initialize_machine_stack_context’:
../src/cont.c:862:32: warning: passing argument 2 of ‘coro_create’ from incompatible pointer type [-Wincompatible-pointer-types]
     coro_create(&fib->context, rb_fiber_start, NULL, fib->ss_sp, fib->ss_size);
                                ^~~~~~~~~~~~~~
In file included from ../src/cont.c:51:0:
../src/libcoro/coro.c:331:1: note: expected ‘coro_func {aka void (*)(void *)}’ but argument is of type ‘__attribute__((noreturn)) void (*)(void)’
 coro_create (coro_context *ctx, coro_func coro, void *arg, void *sptr, size_t ssize)
 ^~~~~~~~~~~
In file included from ../src/libcoro/coro.c:41:0,
                 from ../src/cont.c:51:
../src/cont.c: In function ‘rb_fiber_terminate’:
../src/libcoro/coro.h:401:28: warning: statement with no effect [-Wunused-value]
 # define coro_destroy(ctx) (void *)(ctx)
                            ^~~~~~~~~~~~~
../src/cont.c:1799:5: note: in expansion of macro ‘coro_destroy’
     coro_destroy(&fib->context);
     ^~~~~~~~~~~~
../src/cont.c: At top level:
cc1: warning: unrecognized command line option ‘-Wno-self-assign’
cc1: warning: unrecognized command line option ‘-Wno-constant-logical-operand’
cc1: warning: unrecognized command line option ‘-Wno-parentheses-equality’
setup time for 10000 fibers:   0.146823
execution time for 1000 messages:   7.855211

Updated by ioquatix (Samuel Williams) 9 months ago

Yes, that supports my own test as well.

koyoko% ruby --version
ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-linux]
koyoko% ruby ./fiber_benchmark.rb 10000 1000 
setup time for 10000 fibers:   0.094309
execution time for 1000 messages:  22.248827

koyoko% ./build/bin/ruby --version
ruby 2.6.0dev (2018-05-03 fiber-libcoro 63333) [x86_64-linux]
last_commit=Use libcoro for Fiber implementation to improve performance.
koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.104364
execution time for 1000 messages:  19.717851

koyoko% ./build/bin/ruby --version
ruby 2.6.0dev (2018-05-03 fiber-libcoro 63333) [x86_64-linux]
last_commit=Use libcoro for Fiber implementation to improve performance.
koyoko% ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.104798
execution time for 1000 messages:   8.988672

However, on macOS, I can't reproduce my original results. I apologise. I was playing around with stack allocation. I tried to revert back to that state, but couldn't reproduce the results I gave earlier.

I will continue to investigate.

Updated by ioquatix (Samuel Williams) 9 months ago

Okay, I found out what happened.

On macOS, you need to set

#include "libcoro/coro.c"
#define FIBER_USE_NATIVE 1

Otherwise it won't take the optimal code path. My apologies, I think as I was playing with the code I made that change but didn't commit it after I started patching it to work on Linux, since it seems on Linux that's the default.

Here is the performance improvement.

^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.165381
execution time for 1000 messages:  14.267517

^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.160629
execution time for 1000 messages:   6.307580

So, it's similar speed-up.

I tried to compile without libcoro, but with #define FIBER_USE_NATIVE 1, but it fails because swapcontext/makecontext is deprecated on macOS and compile fails.

Updated by ioquatix (Samuel Williams) 9 months ago

I updated my branch with a few changes.

I'm sorry I didn't rebase on your branch.

I think once we decide if this is a good idea or not, we can decide how best to integrate it with Ruby. I just wanted to make a proof of concept to show it was a good improvement to performance.

My suggestion would be to remove the implementations from cont.c and update libcoro to support all required platforms. The API provided by libcoro is really great and a nice wrapper.

It should be possible to build libcoro on Windows. I do have Windows with Visual Studio set up but I really have no idea how to use it :) However, it wouldn't be silly to update libcoro to make it compile without problems on all supported platforms. It's quite an "old" implementation, but it does work really well. There are some other implementations available too, some are more modern, but I found this one was pretty good.

It might make sense to fork libcoro into a separate repo, I don't mind maintaining it, I already have a fork of it actually, and it's a bit different from the one here. But, it would make sense to update it a bit.

Updated by ioquatix (Samuel Williams) 9 months ago

I was reading https://sourceware.org/ml/libc-help/2016-01/msg00008.html and noticed the following regarding *context functions:

these functions are deprecated/dead -- they no longer exist in the latest
POSIX specification. the preference would be to stop using them. i think
we might consider dropping them in a future glibc version.

Of course they still exist, but yes they are deprecated, and non-existent in the latest POSIX standard. I might even remove it from my fork of libcoro.

Updated by shevegen (Robert A. Heiler) 9 months ago

However, it wouldn't be silly to update libcoro to make it
compile without problems on all supported platforms.

I can't speak for matz and the ruby core team, but in the past
there were (feature-)proposals that were rejected since they
were only specific for e. g. Linux - that is, improvements
pertaining to Linux, but not other OS. I think matz wants to have
ruby be as OS-agnostic as possible; in other words to work on
as many OS as possible, too. And there are quite some people
who use ruby on windows as well, for one reason or another.

As for benchmarks, I think any noticable improvement is a
win and may fit into the "ruby 3 is 3x as fast as ruby 2.0",
but to get to that, it may be more important to verify that
the improvements could also work on windows. Even 3% would
be considerable. :)

By the way, I think there are some ruby-devs who use windows
too ... greg I think. May take a little before the issue here
is seen by them; they could probably help. (I use linux
myself so I won't be of much help.)

Updated by ioquatix (Samuel Williams) 9 months ago

The windows code path for fibers is relatively trivial both in libcoro and cont.c, so I wouldn’t be too concerned about windows support. It shouldn’t be much effort to make it work well in libcoro or keep existing windows code path.

Thanks for your concern and support, and I hope we can get some traction with this improvement.

I use fibers a lot (https://github.com/socketry/async is a fiber [stackful coroutine] based concurrency library). My next step is to benchmark the improvement. It obviously won't be anywhere near 2.2x for real code, but I think it should at least be noticeable.

Updated by shyouhei (Shyouhei Urabe) 9 months ago

I'm neutral. This is a feature request but the "feature" being discussed is the speed of execution. It is by nature different from each other. If this improvement could be truly transparent (and seems currently it is), I think there are chances for acceptance. Wider support for different OSes is definitely nice-to-have of course.

Updated by ioquatix (Samuel Williams) 9 months ago

Thanks for your feedback. When I made this issue, I could only select "Bug", "Feature" or "Misc". Should I have selected "Misc" instead?

Updated by ioquatix (Samuel Williams) 9 months ago

I test in some real world applications today. The first is async, which has a performance test for read context switch overhead: https://github.com/socketry/async/blob/master/spec/async/performance_spec.rb

This isn't direct comparison since I'm using rvm with ruby head and my branch, but it's pretty close.

# Without libcoro fibers
Async::Wrapper
Warming up --------------------------------------
Wrapper#wait_readable
                         1.801k i/100ms
    Reactor#register     2.087k i/100ms
Calculating -------------------------------------
Wrapper#wait_readable
                        176.789k (± 5.7%) i/s -    880.689k in   5.004582s
    Reactor#register    227.882k (± 2.9%) i/s -      1.140M in   5.004740s

Comparison:
    Reactor#register:   227882.2 i/s
Wrapper#wait_readable:   176789.3 i/s - 1.29x  slower

# With libcoro fibers (12% more context switch for read operations)
Async::Wrapper
Warming up --------------------------------------
Wrapper#wait_readable
                         2.217k i/100ms
    Reactor#register     2.380k i/100ms
Calculating -------------------------------------
Wrapper#wait_readable
                        197.116k (± 2.7%) i/s -    986.565k in   5.008582s
    Reactor#register    256.078k (± 4.4%) i/s -      1.278M in   5.003710s

Comparison:
    Reactor#register:   256077.8 i/s
Wrapper#wait_readable:   197115.9 i/s - 1.30x  slower

Updated by ioquatix (Samuel Williams) 9 months ago

Compare async-dns with bind9 for the same workload:

# Without libcoro-fiber
                                     user     system      total        real
Async::DNS::Server               0.000345   0.000029   0.000374 (  0.000381)
Bind9                            0.000294   0.000025   0.000319 (  0.000328)

# With libcoro-fiber (no significant difference)
                                     user     system      total        real
Async::DNS::Server               0.000320   0.000048   0.000368 (  0.000371)
Bind9                            0.000218   0.000033   0.000251 (  0.000258)

This one was a toss-up, I'd say there was no significant difference.

Updated by ioquatix (Samuel Williams) 9 months ago

I tested async-http, a web server, it has a basic performance spec using wrk as the client.

I ran it several times and report the best result of each below. It's difficult to make a judgement. I'd like to say performance was improved but if so, < 5%. However, this benchmark is testing an entire web server stack. Context switching only happens a few times per request.. If I had to take a guess, maybe not more than 4 times (accept, read request, write response). In many cases, we only context switch if the operation would block which is unlikely for small request/response on loopback interface.

# Without libcoro-fiber
Async::HTTP::Server
  simple response
Running 2m test @ http://127.0.0.1:9292/
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   110.06us  647.25us  67.72ms   99.33%
    Req/Sec    12.58k     3.07k   26.94k    70.77%
  12021990 requests in 2.00m, 401.28MB read
Requests/sec: 100100.72
Transfer/sec:      3.34MB

# With libcoro-fiber
Async::HTTP::Server
  simple response
Running 2m test @ http://127.0.0.1:9292/
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   106.47us  834.32us  99.45ms   99.46%
    Req/Sec    12.66k     2.95k   17.61k    71.12%
  12093398 requests in 2.00m, 403.66MB read
Requests/sec: 100694.76
Transfer/sec:      3.36MB

This result surprised me a little bit, but now that I think about it, it could make sense (there is also the possibility I made a mistake or the benchmark is bad). Because the cost of network (read/write) and processing (parsing, generating response, buffers, GC) far outweigh the fiber yield/resume, which is already minimised. In real world situations, the results should lean more in favour of libcoro.

Just for interest, I also collect system call stats.

# Without libcoro
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 45.76    4.635066           2   2095278           sendto
 32.47    3.288691           1   4191323           rt_sigprocmask
 20.90    2.117062           1   2095611       324 recvfrom
  0.67    0.068189        9741         7           poll
  0.07    0.006821           1      6256      5313 openat
  0.03    0.003404           1      4034         5 lstat
  0.01    0.001072           1      1158           read
  0.01    0.001049           1       987           close
  0.01    0.000805           1       901       421 stat
  0.01    0.000627          25        25           clone
  0.01    0.000624           1       793           fstat
  0.01    0.000521           4       124           mmap
  0.00    0.000475           1       798       246 fcntl
  0.00    0.000475           2       297         1 epoll_wait
  0.00    0.000402           3       140           mremap
  0.00    0.000386           1       346       322 epoll_ctl
  0.00    0.000331           1       557       552 ioctl
  0.00    0.000323          16        20           futex
  0.00    0.000321           3        94           mprotect
  0.00    0.000307           1       213           brk
  0.00    0.000255           4        62           getdents
  0.00    0.000183           1       291           getuid
  0.00    0.000180           1       292           geteuid
  0.00    0.000177           1       292           getegid
  0.00    0.000172           1       291           getgid
  0.00    0.000096           3        36           pipe2
  0.00    0.000074           6        12           munmap
  0.00    0.000066          11         6         2 execve
  0.00    0.000052           2        23        14 accept4
  0.00    0.000047           3        18           prctl
  0.00    0.000047           2        27           set_robust_list
  0.00    0.000045           2        19           getpid
  0.00    0.000040           0        81         2 rt_sigaction
  0.00    0.000028           2        16         8 access
  0.00    0.000017           1        15           getcwd
  0.00    0.000016           1        14           readlink
  0.00    0.000016           0       241       238 newfstatat
  0.00    0.000014           0        96           lseek
  0.00    0.000013           1        10           chdir
  0.00    0.000013           3         4           arch_prctl
  0.00    0.000012           0        25           setsockopt
  0.00    0.000009           0        25           getsockname
  0.00    0.000007           2         4           prlimit64
  0.00    0.000006           0        17           getsockopt
  0.00    0.000006           3         2           getrandom
  0.00    0.000004           2         2           sched_getaffinity
  0.00    0.000004           4         1           clock_gettime
  0.00    0.000003           2         2           write
  0.00    0.000003           3         1           sigaltstack
  0.00    0.000003           2         2           set_tid_address
  0.00    0.000002           2         1           vfork
  0.00    0.000001           1         1           wait4
  0.00    0.000001           1         1           getresgid
  0.00    0.000000           0         8           pipe
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         8           socket
  0.00    0.000000           0         8           bind
  0.00    0.000000           0         8           listen
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0         1           getresuid
  0.00    0.000000           0         8           epoll_create1
------ ----------- ----------- --------- --------- ----------------
100.00   10.128563               8400935      7448 total

# With libcoro
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 65.83    5.263501           2   2708883           sendto
 32.87    2.628193           1   2709155       263 recvfrom
  1.06    0.084583       16917         5           poll
  0.09    0.006915           1      6232      5313 openat
  0.06    0.004405           1      4034         5 lstat
  0.02    0.001276           1      1123           read
  0.02    0.001207           1       833       379 stat
  0.01    0.000996           1       963           close
  0.01    0.000510           1       785           fstat
  0.01    0.000492           1       533       528 ioctl
  0.00    0.000330           2       162         1 epoll_wait
  0.00    0.000327           0       797       246 fcntl
  0.00    0.000285          11        25           clone
  0.00    0.000253           1       232           brk
  0.00    0.000253           1       284       260 epoll_ctl
  0.00    0.000239           2       123           mmap
  0.00    0.000207           2        95           mprotect
  0.00    0.000168           8        20           futex
  0.00    0.000163           3        62           getdents
  0.00    0.000142           0       291           getuid
  0.00    0.000139           1       238       235 newfstatat
  0.00    0.000133           0       292           geteuid
  0.00    0.000131           0       291           getgid
  0.00    0.000129           0       292           getegid
  0.00    0.000080           7        12           munmap
  0.00    0.000058           2        32           rt_sigprocmask
  0.00    0.000057           1        88           lseek
  0.00    0.000057           2        36           pipe2
  0.00    0.000044           1        81         2 rt_sigaction
  0.00    0.000043           3        14           readlink
  0.00    0.000039           2        16         8 access
  0.00    0.000036           2        22        13 accept4
  0.00    0.000035           1        27           set_robust_list
  0.00    0.000033           2        18           prctl
  0.00    0.000028           1        19           getpid
  0.00    0.000026           2        15           getcwd
  0.00    0.000020           2        10           chdir
  0.00    0.000013          13         1           wait4
  0.00    0.000009           5         2           getrandom
  0.00    0.000008           0        25           setsockopt
  0.00    0.000006           3         2           write
  0.00    0.000006           0        25           getsockname
  0.00    0.000003           3         1           vfork
  0.00    0.000003           1         6         2 execve
  0.00    0.000003           1         4           arch_prctl
  0.00    0.000003           2         2           set_tid_address
  0.00    0.000003           1         4           prlimit64
  0.00    0.000002           0        17           getsockopt
  0.00    0.000002           2         1           sigaltstack
  0.00    0.000001           1         1           getresuid
  0.00    0.000001           1         1           getresgid
  0.00    0.000001           1         2           sched_getaffinity
  0.00    0.000000           0         8           pipe
  0.00    0.000000           0         1           dup2
  0.00    0.000000           0         8           socket
  0.00    0.000000           0         8           bind
  0.00    0.000000           0         8           listen
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0         1           clock_gettime
  0.00    0.000000           0         8           epoll_create1
------ ----------- ----------- --------- --------- ----------------

rt_sigprocmask was gone because it's not invoked by libcoro unless using swapcontext.

Updated by ioquatix (Samuel Williams) 9 months ago

It's been a while since I played around with libcoro.

I was evaluating it's performance in a C++ program.

I found that it's not thread safe due to global variables. I change them to thread local to fix the issue, it works well.

I just want to reinforce that this was a proof of concept, if we decide to roll with such an implementation, it requires more work. I am happy to help with that but it would be good to get some feedback regarding whether such a contribution would be acceptable before investing so much time.

Updated by ko1 (Koichi Sasada) 9 months ago

Sorry I can't read all of your comments because it too long :p

As you quoted first,

Here is a nice summary of what was done for RethinkDB: https://rethinkdb.com/blog/making-coroutines-fast/

In this article:

A lightweight swapcontext implementation

It shows that swapcontext has extra overhead because of sigprocmask system call.

rt_sigprocmask was gone because it's not invoked by libcoro unless using swapcontext.

Yes.

Last year, I tried modified swapcontext that article introduced, and I got good performance.
(I found Fiber resume/yiled ping ping and I found sigprocmask is one overhead, and google about it, and I also found same page :p)

However, introduced swapcontext is based on glibc, so there is a license problem that we can't merge it into Ruby source code.

Using libcoro (I don't see the library, but as you say) seems to use same tech, so it is one idea to employ.
However, I'm not sure it is the best way.

No conclusion, but it is my current comment.

Thanks,
Koichi

Updated by duerst (Martin Dürst) 9 months ago

ioquatix (Samuel Williams) wrote:

Thanks for your feedback. When I made this issue, I could only select "Bug", "Feature" or "Misc". Should I have selected "Misc" instead?

"Feature" should be okay.

Updated by ioquatix (Samuel Williams) 9 months ago

Thanks Koichi, for your valuable response and I appreciate your past work in this area.

I started hacking on my own implementation for x64. It is slightly simpler than libcoro.

I have been reviewing x64 ABI, and it should be pretty trivial to support both 64-bit Windows ABI and 64-bit System V ABI (Linux, Mac, Solaris, BSD). The amount of code is < 200 lines for both ABIs.

For all other ABIs, I suggest using existing code path. I am happy to release this code to Ruby/MRI under whatever license is suitable.

Please be patient while I finish off the patch, when it is done I will update here.

Updated by ioquatix (Samuel Williams) 9 months ago

What compiler is used to compile 64-bit Ruby on Windows?

Updated by ioquatix (Samuel Williams) 9 months ago

Here is the initial code.

https://github.com/kurocha/coroutine

It implements a semantically similar interface to libcoro, but it supports native coroutines on win32, win64 and amd64. I should add a ucontext wrapper (makecontext/swapcontext) for other platforms, then I think all platforms are supported. libcoro didn't have good windows support.

I've put this code under the MIT license.

Updated by sam.saffron (Sam Saffron) 8 months ago

Does this change move us any closer to being able to ship fibers between threads?

Updated by ko1 (Koichi Sasada) 8 months ago

sorry I missed comments.

How to ship with this library? bundle it or download by others?
(this is similar discussion with jemalloc :))

Updated by ioquatix (Samuel Williams) 8 months ago

ko1 (Koichi Sasada) I would suggest we make a Ruby specific version, but we can also try to make generic static library so that it can be maintained separately. I already have some other projects using coroutines so it's useful to me to have a C library implementation which is maintained well.

sam.saffron (Sam Saffron) This is an interesting question which I did specifically try to address in this implementation. I will give you the details.

Typical implementation of Fiber uses thread local variables for main fiber and currently executing fiber Fiber.current. Because of this, it's annoying to ship fiber between threads. Additionally, I'd argue that moving fibers between threads is inherently not safe. I'd Kindly suggest that a coroutine which can be resumed on different threads is not a "Fiber" but a "Green Thread". The fundamental difference is how Fiber is implemented, and it depends on thread local storage. For example, how would Fiber#resume work on a different thread if it's executing already? Right now, yield and resume are VERY efficient because they don't have to check anything like this.

However, coroutines are the underlying abstraction for implementing Fiber and they CAN be moved across threads.

This particular implementation was designed very carefully to allow for this. In particular, coroutine_transfer function takes two arguments, a coroutine to store the current stack, and a coroutine to restore it's stack. In particular, coroutine_transfer passes both these arguments to the start function, and additionally, coroutine_transfer returns the coroutine that invoked it, so returning back doesn't require any shared state. Because of this, the implementation avoids any kind of "global" state, it's all on the coroutine stack.

Therefore, with this coroutine library, we can nicely implement green threads too, but you'd need to provide additional guarantees/locking around coroutine_transfer. If you want to transfer a coroutine to another thread, you need to move the coroutine_context data structure (contains stack) to the new thread, and the new thread needs to call coroutine_transfer. The coroutine can simply call coroutine_transfer to return back, using either the argument from or the result of a previous coroutine_transfer.

So, the short answer is yes.

ko1 (Koichi Sasada) I also finished implementing for arm64, and hopefully can implement for arm32 soon. I test on raspberry pi :) I don't know about PowerPC, I don't have any hardware to test this. Can we test in a VM?

Updated by ioquatix (Samuel Williams) 8 months ago

Here is the test which shows coroutine arguments and coroutine_transfer result.

https://github.com/kurocha/coroutine/blob/9bbd5e514c2e0f8f3c7c1c277aa6deb5e337a9c7/test/Coroutine/transfer.cpp#L17-L21

The reason for COROUTINE macro is that on win32, in order to avoid lots of stack manipulation, we need to use __fastcall.

Updated by ioquatix (Samuel Williams) 8 months ago

I've made a new branch with the new implementation above.

It shows a slightly improved performance improvement over libcoro.

Here is without the PR:

^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.161763
execution time for 1000 messages:  14.018874
setup time for 10000 fibers:   1.572869
execution time for 1000 messages:  13.778874
setup time for 10000 fibers:   0.917040
execution time for 1000 messages:  13.942525
setup time for 10000 fibers:   1.616929
execution time for 1000 messages:  13.991115
setup time for 10000 fibers:   1.623587
execution time for 1000 messages:  14.281334

And here it is with the PR, on macOS (the same system used in previous benchmarks):

^_^ > ./build/bin/ruby ./fiber_benchmark.rb 10000 1000 
setup time for 10000 fibers:   0.160637
execution time for 1000 messages:   6.009332
setup time for 10000 fibers:   0.244175
execution time for 1000 messages:   6.246711
setup time for 10000 fibers:   0.242718
execution time for 1000 messages:   6.142166
setup time for 10000 fibers:   0.233410
execution time for 1000 messages:   5.994752
setup time for 10000 fibers:   0.288830
execution time for 1000 messages:   6.216617

Performance is about 2~2.5x faster depending on your analysis. Both creation and execution time is improved. But remember this is micro-benchmark.

I was also interested in mjit performance:

Without PR, enabled mjit:

^_^ > ./build/bin/ruby --jit ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.172145
execution time for 1000 messages:  25.176702
setup time for 10000 fibers:   1.654751
execution time for 1000 messages:  14.729177
setup time for 10000 fibers:   1.016810
execution time for 1000 messages:  15.154141
setup time for 10000 fibers:   1.726305
execution time for 1000 messages:  14.797269
setup time for 10000 fibers:   2.025997
execution time for 1000 messages:  15.124753

With PR, enabled mjit:

x_x > ./build/bin/ruby --jit ./fiber_benchmark.rb 10000 1000
setup time for 10000 fibers:   0.179744
execution time for 1000 messages:  13.793318
setup time for 10000 fibers:   0.354717
execution time for 1000 messages:  10.664870
setup time for 10000 fibers:   0.308818
execution time for 1000 messages:   6.956352
setup time for 10000 fibers:   0.378568
execution time for 1000 messages:   6.553922
setup time for 10000 fibers:   0.295583
execution time for 1000 messages:   7.274086

We can see it still needs a bit of work.

I will try to isolate some interesting results from higher level frameworks.

The updated branch is here: https://github.com/ioquatix/ruby/tree/native-fiber

It only work on Darwin x64 at the moment, because changes to autoconf do not cover all platforms yet. I'll fix this soon.

Updated by ioquatix (Samuel Williams) 8 months ago

I fixed autoconf issues and built on Linux. The performance improvement was even more impressive.

koyoko% ruby --version
ruby 2.6.0dev (2018-06-01 native-fiber 63544) [x86_64-linux]
last_commit=Better support for amd64 platforms
koyoko% ruby ./fiber_benchmark.rb 
setup time for 1000 fibers:   0.007222
execution time for 10000 messages:   3.433891
setup time for 1000 fibers:   0.015365
execution time for 10000 messages:   3.177730
setup time for 1000 fibers:   0.010035
execution time for 10000 messages:   3.205329
setup time for 1000 fibers:   0.012063
execution time for 10000 messages:   2.968101
setup time for 1000 fibers:   0.010448
execution time for 10000 messages:   2.947756
koyoko% rvm use 2.6
Using /home/samuel/.rvm/gems/ruby-2.6.0-preview2
koyoko% ruby --version           
ruby 2.6.0preview2 (2018-05-31 trunk 63539) [x86_64-linux]
koyoko% ruby ./fiber_benchmark.rb
setup time for 1000 fibers:   0.006881
execution time for 10000 messages:  13.242779
setup time for 1000 fibers:   0.009869
execution time for 10000 messages:  13.468187
setup time for 1000 fibers:   0.013938
execution time for 10000 messages:  12.691139
setup time for 1000 fibers:   0.014423
execution time for 10000 messages:  12.005481
setup time for 1000 fibers:   0.013953
execution time for 10000 messages:  12.535145

nobu (Nobuyoshi Nakada) do you mind confirming?

Updated by ioquatix (Samuel Williams) 8 months ago

Here is a more realistic benchmark which fiber context switch is only a tiny percentage of the actual run-time.

A brief summary of the benchmark: async-http uses an event-driven stackful coroutine (fiber) based design. Each request allocates a fiber, and each blocking operation (i.e. read) results in Fiber.yield. Once the IO is ready, Fiber#resume is called. So, for each request being processed, we expect several calls to Fiber.yield. async is optimistic so it tries to perform the operation e.g. read and only yields if it results in EWOULDBLOCK so in some cases (especially in synthetic benchmarks) some scheduling may be elided.

koyoko% rvm use 2.6
Using /home/samuel/.rvm/gems/ruby-2.6.0-preview2
koyoko% ruby --version
ruby 2.6.0preview2 (2018-05-31 trunk 63539) [x86_64-linux]
koyoko% bundle exec rake wrk
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    63.59us   77.52us   4.53ms   98.32%
    Req/Sec    16.68k     1.07k   18.32k    74.26%
  167544 requests in 10.10s, 14.54MB read
Requests/sec:  16589.33
Transfer/sec:      1.44MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    60.85us   34.26us   1.39ms   95.82%
    Req/Sec    16.82k     0.87k   18.49k    70.00%
  167424 requests in 10.00s, 14.53MB read
Requests/sec:  16742.19
Transfer/sec:      1.45MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    62.44us   54.34us   3.81ms   97.62%
    Req/Sec    16.62k     1.00k   18.09k    67.33%
  166959 requests in 10.10s, 14.49MB read
Requests/sec:  16530.76
Transfer/sec:      1.43MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    61.89us   32.53us 687.00us   94.29%
    Req/Sec    16.54k     1.20k   18.37k    67.33%
  166105 requests in 10.10s, 14.42MB read
Requests/sec:  16445.91
Transfer/sec:      1.43MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    60.90us   37.64us   1.70ms   95.89%
    Req/Sec    16.89k     1.22k   18.57k    72.28%
  169694 requests in 10.10s, 14.73MB read
Requests/sec:  16802.33
Transfer/sec:      1.46MB

Here is with the PR:

koyoko% rvm use ruby-head-fiber
Using /home/samuel/.rvm/gems/ruby-head-fiber
koyoko% ruby --version
ruby 2.6.0dev (2018-06-01 native-fiber 63544) [x86_64-linux]
last_commit=Better support for amd64 platforms
koyoko% bundle exec rake wrk
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    62.53us   73.11us   5.02ms   97.96%
    Req/Sec    16.80k     1.35k   19.46k    63.37%
  168863 requests in 10.10s, 14.65MB read
Requests/sec:  16719.77
Transfer/sec:      1.45MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    58.91us   35.19us   1.54ms   95.25%
    Req/Sec    17.49k     1.16k   19.42k    69.31%
  175719 requests in 10.10s, 15.25MB read
Requests/sec:  17399.00
Transfer/sec:      1.51MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    58.64us   45.92us   3.09ms   96.88%
    Req/Sec    17.72k     1.10k   19.42k    71.29%
  178027 requests in 10.10s, 15.45MB read
Requests/sec:  17626.32
Transfer/sec:      1.53MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    60.83us   33.93us   1.06ms   94.93%
    Req/Sec    16.86k     1.54k   19.36k    63.37%
  169307 requests in 10.10s, 14.69MB read
Requests/sec:  16764.19
Transfer/sec:      1.45MB
Running 10s test @ http://127.0.0.1:9294/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    59.07us   39.77us   2.17ms   95.97%
    Req/Sec    17.52k     0.98k   19.32k    66.34%
  176112 requests in 10.10s, 15.28MB read
Requests/sec:  17436.64
Transfer/sec:      1.51MB

This is actually better than I expected. I would say there is a practical improvement of about ~5%. In this situation it's very workload dependent, but I'm glad that I saw something.

Updated by cremes (Chuck Remes) 8 months ago

I'd like to link this to another open issue regarding Fiber migration between threads. https://bugs.ruby-lang.org/issues/13821

ioquatix (Samuel Williams), please note in the above-referenced bug that I put in a link to the "boost" documentation regarding coroutine movement between threads. An explicit API to lock/unlock ownership of the fiber to a thread would probably resolve some of the complaints people raise about fiber migration. If it's explicit, more guarantees can be made. Default behavior should be the current behavior where Fibers cannot migrate.

Thanks for your work on this.

Updated by ioquatix (Samuel Williams) 8 months ago

cremes (Chuck Remes) Thanks for your positive feedback and linking me to related issues.

The coroutine implementation was specifically designed to handle cross-thread migrations, in the sense that all the required state to yield/resume is passed as arguments/returns to/from the coroutine.

What this means is that no global/thread-local state is required and thus when moving a coroutine to another thread, there is almost no additional data to sync which is nice from an API point of view.

The bigger challenge is how Ruby Fiber is implemented. It does make it tricky. I would be happy to work towards this. I see the following path being viable:

  • Merge these changes.
  • Simplify the Fiber implementation by removing all the other implementations from cont.c and if necessary move these to the coroutine code (but ideally remove them).
  • With the simplified Fiber code base, explore the overheads of Fiber creation/context switching and figure out the right places to put locking/checks (e.g. for locks being held, etc).

Updated by matz (Yukihiro Matsumoto) 4 months ago

OK, it sounds reasonable. We will give you commit privilege.

Matz.

Updated by hsbt (Hiroshi SHIBATA) 4 months ago

Hi, ioquatix.

I send an invitation of the ruby core team. Please check it.

Updated by ioquatix (Samuel Williams) about 1 month ago

  • Target version set to 2.6
  • Assignee set to ioquatix (Samuel Williams)
  • Status changed from Open to Closed

This is now implemented across: arm32, arm64, ppc64le, win32, win64, x86, amd64. Thanks to everyone who helped with this. This is a really awesome first step to improving Ruby Fiber performance.

Also available in: Atom PDF