Project

General

Profile

Actions

Bug #17664

open

Behavior of sockets changed in Ruby 3.0 to non-blocking

Added by ciconia (Sharon Rosner) 7 months ago. Updated 2 months ago.

Status:
Assigned
Priority:
Normal
Target version:
-
[ruby-core:102652]

Description

I'm not sure this is a bug, but apparently a change was introduced in Ruby 3.0 that makes sockets non-blocking by default. This change was apparently introduced as part of the work on the FiberScheduler interface. This change of behaviour is not discussed in the Ruby 3.0.0 release notes.

This change complicates the implementation of an io_uring-based fiber scheduler, since io_uring SQE's on fd's with O_NONBLOCK can return EAGAIN just like normal syscalls. Using io_uring with non-blocking fd's defeats the whole purpose of using io_uring in the first place.

A workaround I have put in place in the Polyphony io_uring backend is to make sure O_NONBLOCK is not set before attempting I/O operations on any fd.

Updated by xtkoba (Tee KOBAYASHI) 7 months ago

Is this issue related to #17607 and/or #15356 ?

Updated by jeremyevans0 (Jeremy Evans) 7 months ago

  • Assignee set to ioquatix (Samuel Williams)
  • Status changed from Open to Assigned

I believe this is expected, even if not mentioned in the release notes. I think the Ruby-level API remains the same, but passing the underlying file descriptors to C functions can see changed behavior. Assigning to ioquatix (Samuel Williams) to confirm this is expected.

Updated by ioquatix (Samuel Williams) 3 months ago

This change was originally proposed and implemented by normalperson (Eric Wong).

The outward interface does not change, but you are right it can impact io_uring implementation.

I know this is a problem and ran into the same issue.

I don't know how we should solve the io_uring issue correctly. There are several options:

  • Change Ruby default IO back to blocking. But it can cause issues for fiber scheduler since non-blocking hooks will never be invoked.
  • In the uring backend, for read/write operations, set the IO to blocking and then revert it afterwards.
  • Propose that uring takes a flag/option to do this per SQE, i.e. even if the I/O is set to non-blocking mode, e.g. OP_READV should always behave like blocking operation.

Personally, I like the last option best, since I think it's more predictable. We cannot determine the state of the FD just because of Ruby's default. i.e. if a user makes an IO blocking or non-blocking, we prefer if uring read/write behaves predictably. There is a precedent for this with sendmsg/recvmsg too.

I welcome discussion on this point, but for certain, I believe Ruby being non-blocking by default makes sense and that approach was proposed by Eric, and I agreed with it and finally enabled it in Ruby 3. Since there is no outwardly visible change to behaviour, I didn't think it's a big problem, but I also noticed that Eric forced Unicorn IO back to blocking by default, so it might be nice to have their input on the matter.

Updated by normalperson (Eric Wong) 3 months ago

samuel@oriontransfer.net wrote:

I welcome discussion on this point, but for certain, I believe
Ruby being non-blocking by default makes sense and that
approach was proposed by Eric, and I agreed with it and
finally enabled it in Ruby 3. Since there is no outwardly
visible change to behaviour, I didn't think it's a big
problem, but I also noticed that Eric forced Unicorn IO back
to blocking by default, so it might be nice to have is input
too.

Yes, I proposed non-blocking originally; but gave up the
proposal because of potential incompatibilities (e.g. this one).
I've mostly given up on Ruby (and coding in general),
so the change to non-blocking happened anyways...

Anyways, unicorn doesn't benefit at all from non-blocking socket
I/O since it only handles one fast client at-a-time. In
unicorn, blocking I/O results in fewer syscalls since there's no
intermediate calls to poll/ppoll/select/etc. Non-blocking I/O
only makes sense for slow clients (and unicorn could never and
will never be capable of handling slow clients).

Anyways, I haven't familiarized myself with io_uring, yet; but
maybe I'll get around it (just not for Ruby :P) if I still have
electricity in a few months time...

Updated by ioquatix (Samuel Williams) 3 months ago

Thanks normalperson (Eric Wong) - I understand you aren't interested much in Ruby but wish you the best. Thanks for chiming in promptly.

I probably wouldn't characterise this as an incompatibility, because this problem can surface if the user explicitly opts into non-blocking IO. So, that's a valid state for the file descriptor and we need to handle it.

I believe we should involve the authors of io_uring in this discussion because it seems to me we need a more general solution. In general, file descriptors non-blocking state is something that we cannot anticipate. https://github.com/axboe/liburing/issues/364

Updated by ciconia (Sharon Rosner) 3 months ago

In the uring backend, for read/write operations, set the IO to blocking and then revert it afterwards.

Why would you need to revert it? In practically all cases I can think of, you're going to do all I/O for a given fd on the same scheduler. In addition, if you need to make two additional fcntl system calls on every I/O operation, it defeats the whole purpose of using io_uring in the first place.

I had another solution in mind, similar to what I mentioned above, but more general:

  • Cache the blocking/non-blocking state in an instance variable on the IO/Socket instance.
  • If the instance variable is not set, call fcntl and set the instance variable.
  • This could be done in order to implement both blocking and non-blocking behavior, according to the type of fiber scheduler. So a libev-based scheduler would be able to set it to non-blocking, and an io_uring-based one would set it to blocking.

Pseudo-code:

def check_blocking_state(io, block)
  state = io.instance_variable_get(:blocking_state)
  if block != state
    flags = fcntl(io.fd, F_GETFL)
    block ? (flags ~= flags & ~O_NONBLOCK) : (flags |= O_NONBLOCK)
    fcntl(io.fd, F_SETFL, flags);
    io.instance_variable_set(:blocking_state, block)
  end
end

This solution, called before any I/O operation, keeps the extra system calls to a minimum and lets you implement schedulers for both blocking and non-blocking I/O. It's also fully backwards compatible with the core Ruby IO and Socket implementations.

(BTW the current IO implementation, when no fiber scheduler is used, calls fcntl at least once for basically every I/O operation, which is also a waste in practically all cases.)

Change Ruby default IO back to blocking. But it can cause issues for fiber scheduler since non-blocking hooks will never be invoked.

I think sockets should be changed back to blocking by default, even if just for the sake of consistency. This change took me by surprise, and it cost me a few hours of looking around trying to figure out why I was getting EAGAIN on sockets and not on files.

Updated by Eregon (Benoit Daloze) 3 months ago

One potential issue with caching the non-blocking state in Ruby is that a C extension might call fcntl() directly on the fd of an IO to change its non-blocking state.
Not sure if that's done in practice, but it could be an issue.

Updated by ioquatix (Samuel Williams) 2 months ago

I have researched this topic today and I'm going to share some of my notes and thoughts.

Firstly, with regards to performance, the most important platform is Linux, and I personally believe that io_uring is going to be the most important interface. We can also support epoll as a fallback, but it's less complete. Other interfaces, kqueue is similar to epoll and is less interesting.

An important point to consider, is that on Linux, I've been told that sockets don't support asynchronous read and/or write. Internally they are emulated by the same user-space implementation - try reading, and on EAGAIN fall back to polling.

In my testing, comparing io_uring io_read and io_write operations, perform about 20% worse in practice in my benchmarks. This was surprising to me. My current understanding as to why it's slow is because when we perform io_read, internally it performs read, but because we have to defer that operation until the next iteration of the run loop, we pay quite a bit latency cost here.

The fast path is this:

result = read(fd, ...)
if (EAGAIN) {
  wait_readable
  Fiber.yield
}

// In scheduler:
select -> fiber.resume(result)

The slow path is this:

io_read(fd, ...) -> OP_READ SQE
io_uring_submit() // optional but improve throughput by 5% if done here
Fiber.yield // Unconditional.

// In scheduler:
wait_cqe -> fiber.resume result

Now, there are actually two interpretations of the above, essentially it depends on the percentage of operations you expect read to result in EAGAIN. If you expect that percentage to be low, the single system call for read is far more efficient for Ruby, since we avoid the context switch. In both cases you need to call a system call, either read or io_uring_submit. With io_uring_submit, you can amortise the cost of the system calls, but it turns out that it's much less than the cost of the context switch in Ruby from what I can tell. The interesting point is, when the IO is not so busy, and we expect a higher chance of EAGAIN, the overhead of the yield is far less important.

In my benchmarks using 64+ connections, I only observed OP_READ -> EAGAIN < 20 times out of 1 million requests. So, it definitely wasn' a common code path on a server with a lot of active IO. In the case that you DO get EAGAIN, realistically it seems like you will have to wait for a while anyway, so the extra cost of punting the operation through to io_wait seems negligible in practice.

So the net result of this is, from what I can measure so far, non-blocking sockets are the most efficient way to handle IO. Forcing sockets to go through OP_READ seems to yield worse performance in every configuration I could think of. I'm gong to continue investigating this as I'm a little bit unconvinced by the results but based on what I'm seeing, non-blocking sockets (O_NONBLOCK) seems significantly more efficient. If you can produce benchmarks which show something other that what I've found so far, I'd be most interested and I think it would help make the case that O_NONBLOCK by default was the wrong choice.

As an aside, I did try to make stdin, stdout and stderr nonblocking.

It turns out it's pretty difficult as a ton of things start breaking in unexpected ways - e.g. printf.

Fortunately, I think there is a good solution - we do have the ability to check if an IO is in blocking mode, and if that's the case, we can punt it off to OP_READ which while a little bit slower will do the right thing without needing O_NONBLOCK. This allows us to have non-blocking stdin, stdout which would be really great.

I'm still working out the details of how this should fit together within io.c but largely I'm convinced that:

  • Non-blocking Socket is the fast path.
  • Blocking file descriptors can still be asynchronous. In io_uring we can use OP_READ and in epoll/kqueue we can use fcntl to toggle O_NONBLOCK. I don't care much about performance impact in epoll and kqueue cases since it's not what I'm considering a hot path.

This change complicates the implementation of an io_uring-based fiber scheduler, since io_uring SQE's on fd's with O_NONBLOCK can return EAGAIN just like normal syscalls. Using io_uring with non-blocking fd's defeats the whole purpose of using io_uring in the first place.

I believe the correct implementation here is like this: https://github.com/socketry/event/blob/8c4449ebe0a3c76681655cf175d5aa6589934a9c/ext/event/backend/uring.c#L291-L306

If you get EAGAIN, you should punt the request off to your io_wait logic.

We are also discussion whether we can get io_uring to implement this automatically: https://github.com/axboe/liburing/issues/364 but I'm now not sure if this is really the right approach.

One final point to consider is buffer management. If we have thousands of sockets, OP_READ-based implementations need one buffer per operation at least. But io_wait style implementation can avoid the need for a buffer until the operation can proceed. io_uring has some solutions for buffer assignment, but it may not be that easy to take advantage of unless we adopt something like the IO::Buffer proposal and internally allocate a pool of them for non-blocking IO.

Updated by ioquatix (Samuel Williams) 2 months ago

Here are some summaries from strace -c:

Using non-blocking sockets (note the errors column which indicates EAGAIN):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 28.10    0.621514           5    119770           close
 25.50    0.564066           4    119530           write
 18.39    0.406707           3    119532         2 accept4
 14.50    0.320830           2    119844         1 read
 13.39    0.296214           2    119922         9 newfstatat
  0.04    0.000961           1       494       260 openat
  0.02    0.000361           1       186           mmap
  0.02    0.000333           0       551       551 readlink
  0.01    0.000227           2        87           munmap
  0.01    0.000138           2        54           brk
  0.01    0.000116           4        26           getdents64
  0.00    0.000097           1        80           fcntl
  0.00    0.000062           6         9         1 io_uring_enter
  0.00    0.000059           0        63           mprotect
  0.00    0.000036           0       131       128 ioctl
  0.00    0.000030          10         3           getsockname
  0.00    0.000029           0        53           geteuid
  0.00    0.000028           5         5         2 connect
  0.00    0.000027          27         1           io_uring_setup
  0.00    0.000026           0        53           getegid
  0.00    0.000025           0        52           getuid
  0.00    0.000025           0        52           getgid
  0.00    0.000020           2         7           socket
  0.00    0.000014           2         7         4 prctl
  0.00    0.000008           1         5           futex
  0.00    0.000007           1         7           rt_sigprocmask
  0.00    0.000004           1         3           getpid
  0.00    0.000004           2         2           bind
  0.00    0.000003           1         2           sendto
  0.00    0.000003           1         3           recvmsg
  0.00    0.000003           3         1           listen
  0.00    0.000002           0        73           lseek
  0.00    0.000002           1         2         1 recvfrom
  0.00    0.000002           2         1           setsockopt
  0.00    0.000001           1         1           ppoll
  0.00    0.000000           0        17           rt_sigaction
  0.00    0.000000           0         6           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         1           sysinfo
  0.00    0.000000           0         1           sigaltstack
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           timer_create
  0.00    0.000000           0         1           clock_gettime
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         2           eventfd2
  0.00    0.000000           0         3           prlimit64
  0.00    0.000000           0         1           getrandom
------ ----------- ----------- --------- --------- ------------------
100.00    2.211984           3    600654       961 total

Using blocking sockets:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 23.74    0.504460           5     92066         1 io_uring_enter
 22.62    0.480659           5     92300           close
 16.98    0.360809           1    184200           fcntl
 14.83    0.315110           3     92062         2 accept4
 11.16    0.237053           2     92373           read
 10.55    0.224161           2     92452         9 newfstatat
  0.03    0.000714           1       494       260 openat
  0.02    0.000484           0       551       551 readlink
  0.01    0.000274           4        62           brk
  0.01    0.000262           3        87           munmap
  0.01    0.000255           1       186           mmap
  0.01    0.000210           1       130       127 ioctl
  0.01    0.000123           1        63           mprotect
  0.00    0.000082           1        73           lseek
  0.00    0.000036           2        17           rt_sigaction
  0.00    0.000030           0        53           geteuid
  0.00    0.000027           0        52           getuid
  0.00    0.000026           0        52           getgid
  0.00    0.000026           0        53           getegid
  0.00    0.000019           0        26           getdents64
  0.00    0.000009           4         2           eventfd2
  0.00    0.000005           0         7           rt_sigprocmask
  0.00    0.000005           5         1           getrandom
  0.00    0.000004           4         1           sysinfo
  0.00    0.000004           4         1           timer_create
  0.00    0.000003           1         3           getpid
  0.00    0.000003           3         1           sigaltstack
  0.00    0.000003           0         5           futex
  0.00    0.000000           0         6           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         7           socket
  0.00    0.000000           0         5         2 connect
  0.00    0.000000           0         2           sendto
  0.00    0.000000           0         2         1 recvfrom
  0.00    0.000000           0         3           recvmsg
  0.00    0.000000           0         2           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         3           getsockname
  0.00    0.000000           0         1           setsockopt
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         7         4 prctl
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           gettid
  0.00    0.000000           0         1           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           clock_gettime
  0.00    0.000000           0         1           ppoll
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         3           prlimit64
  0.00    0.000000           0         1           io_uring_setup
------ ----------- ----------- --------- --------- ------------------
100.00    2.124856           3    647427       959 total

Based on this, maybe my implementation of read is not working correctly. I'll have to check it, but generally, you can see the big difference.

Updated by ciconia (Sharon Rosner) 2 months ago

In my testing, comparing io_uring io_read and io_write operations, perform about 20% worse in practice in my benchmarks. This was surprising to me. My current understanding as to why it's slow is because when we perform io_read, internally it performs read, but because we have to defer that operation until the next iteration of the run loop, we pay quite a bit latency cost here.

I think the increased latency is to be expected. Did you measure throughput? In my own benchmarks (on Polyphony) I've seen better throughput, slightly worse latency (don't remember the numbers though).

In my benchmarks using 64+ connections, I only observed OP_READ -> EAGAIN < 20 times out of 1 million requests.

My understanding is sockets are buffered, so normally you'll see EAGAIN only if you saturate them.

Using blocking sockets:

Looking at those numbers a few things stand out (I'm assuming your benchmark is with some kind of HTTP server):

  • There's no line for write (is the blocking version working correctly?)
  • fcntl is at 17% - this is a serious cost if you need to do this on every read/write
  • io_uring_enter is called once for each accept, so apparently there's no batching of SQEs.

Skimming your io_uring backend code I see you're iterating over available CQEs and resume the fiber for each CQE while iterating. In my experience you'll get better numbers if you put those resumable fibers in an array instead, then resume them one by one after having exhausted all available CQEs.

Hope this helps!

Updated by ioquatix (Samuel Williams) 2 months ago

My understanding is sockets are buffered, so normally you'll see EAGAIN only if you saturate them.

For write this seems totally reasonable, but I was also checking read which you'd expect to block more often.

fcntl is at 17% - this is a serious cost if you need to do this on every read/write

Yes, agreed. I'll re-run the benchmark after recompiling Ruby with blocking sockets to see if there is an impact or not.

Skimming your io_uring backend code I see you're iterating over available CQEs and resume the fiber for each CQE while iterating.

I can certainly try that. I'm not sure how putting the fibers into the ready array would help since it adds an extra layer of indirection, but I can imagine that it frees up the CQ before new entries are entered into the SQ.

Updated by ioquatix (Samuel Williams) 2 months ago

I missed some functions in io.c which could invoke read. That's why read was showing up. Now that's been patched:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 41.67    0.099333           4     21439         1 io_uring_enter
 27.40    0.065317           5     10957           close
 17.63    0.042024           3     10719         2 accept4
 12.07    0.028768           2     11109         9 newfstatat
  0.24    0.000570           1       313           read

All read and write operations are going via the uring. I feel fairly confident of this (on io-buffer branch).

Measuring performance systematically is a bit of a pain but here are some comparisons.

Comparing io_uring_submit (right away after prep sqe vs deferred)

I found in some cases it's advantageous to call it right away, but in this case, it didn't seem to help much. But at least you can see impact to sys call count to confirm it's working.

All I/O is non-blocking in this test. DirectScheduler means it implements io_read and io_write (OP_READ and OP_WRITE respectively). Scheduler means it doesn't implement io_read and io_write which forces io_wait path. strace -c is captured separately and the numbers are different but proportionally correct (because it's slower).

Early io_uring_submit (DirectScheduler, 128 connections)

Running 30s test @ http://localhost:9090
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.52ms  578.19us  13.65ms   64.86%
    Req/Sec     9.03k     0.95k   18.68k    72.95%
  1076598 requests in 30.05s, 47.23MB read
Requests/sec:  35831.03
Transfer/sec:      1.57MB

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 41.33    0.090196           4     21218           io_uring_enter
 25.61    0.055886           5     10753           close
 16.64    0.036306           3     10516         3 accept4
 11.46    0.025016           2     10905         9 newfstatat
  2.80    0.006115           2      2079           mprotect

Deferred io_uring_submit (DirectScheduler, 128 connections)

Running 30s test @ http://localhost:9090
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.41ms  677.38us  12.00ms   63.73%
    Req/Sec     9.31k     1.01k   11.65k    56.88%
  1110951 requests in 30.05s, 48.74MB read
Requests/sec:  36975.48
Transfer/sec:      1.62MB

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 36.38    0.074410           6     11745           io_uring_enter
 27.47    0.056179           4     11987           close
 18.97    0.038801           3     11751         4 accept4
 13.02    0.026624           2     12139         9 newfstatat
  2.94    0.006010           2      2079           mprotect

Early io_uring_submit (Scheduler, 128 connections)

Running 30s test @ http://localhost:9090
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.94ms  567.14us  17.13ms   66.30%
    Req/Sec    10.81k     1.02k   13.44k    58.47%
  1289574 requests in 30.04s, 56.57MB read
Requests/sec:  42926.80
Transfer/sec:      1.88MB

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 23.69    0.048398           4      9924           close
 22.35    0.045668           4      9684           write
 16.04    0.032769           3      9687         3 accept4
 12.88    0.026323           2     10076        79 read
 11.58    0.023663           2     10076         9 newfstatat
  9.70    0.019821           2      9767           io_uring_enter
  2.76    0.005635           2      2079           mprotect

Deferred io_uring_submit (Scheduler, 128 connections)

I feel like the io_uring_enter count here is a bit off. Maybe because the select operation always calls io_uring_enter even if there are no SQEs outstanding. Perhaps it should only be called if the SQ is not empty.

Running 30s test @ http://localhost:9090
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.04ms  557.79us  21.41ms   67.67%
    Req/Sec    10.44k     0.96k   13.07k    64.22%
  1245909 requests in 30.07s, 54.66MB read
Requests/sec:  41434.63
Transfer/sec:      1.82MB

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 23.57    0.057679           5      9964           close
 22.06    0.053990           5      9724           write
 15.48    0.037879           3      9727         3 accept4
 12.95    0.031689           3     10116        79 read
 11.50    0.028137           2     10116         9 newfstatat
  9.24    0.022621           2      9725         1 io_uring_enter
  3.43    0.008383           4      2079           mprotect

Updated by ioquatix (Samuel Williams) 2 months ago

I rewrote the uring implementation to track the number of pending operations. I'm kind of surprised that the SQ doesn't do this.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 26.59    0.049621           4     10575           close
 24.85    0.046369           4     10335           write
 17.34    0.032363           3     10338         3 accept4
 13.77    0.025693           2     10727        79 read
 11.82    0.022051           2     10727         9 newfstatat
  5.31    0.009911           4      2079           mprotect
  0.11    0.000207           1       192           mmap
  0.09    0.000165           1        87           munmap
  0.05    0.000087           1        48           brk
  0.04    0.000083           1        83         1 io_uring_enter

Now for non-blocking sockets, we only see io_uring_enter count proportional to the number of read and write errors which makes more sense to me.

Running 30s test @ http://localhost:9090
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.87ms    1.36ms  90.28ms   98.60%
    Req/Sec    11.16k     1.40k   26.84k    70.73%
  1332172 requests in 30.04s, 58.44MB read
Requests/sec:  44341.92
Transfer/sec:      1.95MB

As you can imagine, with only 89 calls to io_uring_enter, whether or not it's done early or later has little impact on overall performance.

Updated by ioquatix (Samuel Williams) 2 months ago

Here is DirectScheduler:

Running 30s test @ http://localhost:9090
  4 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.55ms  658.12us  14.96ms   64.38%
    Req/Sec     8.92k     0.99k   11.92k    72.81%
  1063854 requests in 30.09s, 46.67MB read
Requests/sec:  35361.57
Transfer/sec:      1.55MB

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 37.77    0.797226           6    120032           io_uring_enter
 29.02    0.612538           5    120277           close
 19.45    0.410488           3    120041         4 accept4
 13.24    0.279448           2    120429         9 newfstatat
  0.31    0.006596           3      2079           mprotect

I guess it's at least 20-30% slower. My gut feeling is, greedy read which avoids fiber context switch improves throughput.

Based on this result, I'm still unconvinced we should change sockets back to blocking.

However, we do want to enable DirectScheduler to work efficiently with both.

I propose the following changes:

  • DirectScheduler makes sense for general IO including blocking IO. IO#read and IO#write should invoke the fiber scheduler io_read and io_write respectively. This enables things like non-blocking read/write to stdin and stdout without making them O_NONBLOCK. io_uring supports this directly, while epoll and kqueue will need to use a fcntl wrapper.
  • Socket#read and Socket#write should be implemented via a different scheduler hook, maybe socket_read and socket_write, to go along with what will eventually includesocket_recvmsg and socket_sendmsg etc. The implementation of socket_read and socket_write could be the same as io_read and io_write, but for performance reasons should use read|write -> EAGAIN -> polling instead.
  • We might need to check the most efficient way to deal with pipes, I suspect they are more similar to sockets internally than files.

ciconia (Sharon Rosner) what do you think?

Updated by ioquatix (Samuel Williams) 2 months ago

I was playing around with a larger number of connections and the deferred submit:

io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
select_process_completions(completed=38)

So certainly seems that it's capable of handling lots of events. But what I noticed is that there are a lot of minimally interleaved iterations:

io_write:Event_Backend_fiber_transfer -> 46                                                                                                                           
io_read:Event_Backend_fiber_transfer -> 40                                                                                                                            
select_process_completions(completed=2)                                                                                                                               
io_write:Event_Backend_fiber_transfer -> 46                                                                                                                           
io_read:Event_Backend_fiber_transfer -> 40                                                                                                                            
select_process_completions(completed=2)                                                                                                                               
io_write:Event_Backend_fiber_transfer -> 46                                                                                                                           
io_read:Event_Backend_fiber_transfer -> 40                                                                                                                            
select_process_completions(completed=2)                                                                                                                               
io_write:Event_Backend_fiber_transfer -> 46                                                                                                                           
io_read:Event_Backend_fiber_transfer -> 40                                                                                                                            
select_process_completions(completed=2)                                                                                                                               

Maybe because #accept and #close are not asynchronous, it causes more issues.

Updated by ciconia (Sharon Rosner) 2 months ago

I missed some functions in io.c which could invoke read. That's why read was showing up.

Sorry, this tripped me up and I was looking for a corresponding write line.

I propose the following changes:

  • DirectScheduler makes sense for general IO including blocking IO. IO#read and IO#write should invoke the fiber scheduler io_read and io_write respectively. This enables things like non-blocking read/write to stdin and stdout without making them O_NONBLOCK. io_uring supports this directly, while epoll and kqueue will need to use a fcntl wrapper.
  • Socket#read and Socket#write should be implemented via a different scheduler hook, maybe socket_read and socket_write, to go along with what will eventually includesocket_recvmsg and socket_sendmsg etc. The implementation of socket_read and socket_write could be the same as io_read and io_write, but for performance reasons should use read|write -> EAGAIN -> polling instead.
  • We might need to check the most efficient way to deal with pipes, I suspect they are more similar to sockets internally than files.

This seems fine to me, but to play the devil's advocate you are making a design decision which is based on:

  • Anecdotal benchmarks - the performance difference you see might be reversed in different circumstances.
  • A fiber scheduler implementation that is external to Ruby.

Another point I wanted to bring up is that if you are indeed going to implement this kind of behavior in a fiber scheduler then the fiber switching becomes non-deterministic. This has ramifications for the behavior of user programs, in two important ways:

  • You will not be able to tell whether a fiber switch happens on calling IO#read et al, which might lead to difficulties in debugging.
  • Cancelling an I/O operation where there's no fiber switch happening becomes impossible. Cancellation is a whole subject in itself, it is not addressed at all by the current fiber scheduler spec, and IMO is a crucial aspect of managing concurrency.

I'll just give a very simple example (I don't know how Fiber#raise interacts with the fiber scheduler mechanism, if at all, but let's suppose the fiber scheduler knows how to deal with that):

f1 = Fiber.schedule do
  @io.write('foo') # is a ctx switch happening here?
  puts 'oh hi' # or here?
rescue
  @some_other_io.puts 'oh bye' # or here?
end

f2 = Fiber.schedule do
  f1.raise
end

With your proposition, the output of the above program will change according to the fiber scheduler implementation and whether @io is a socket or file or something else.

I think it would be better to always do a fiber switch on any I/O. That's what I do for example in the Polyphony libev backend: if the read was immediately successful, the fiber snoozes (schedules itself and yields control to some other fiber). Deterministic behavior is IMO one of the main advantages of using fibers compared to threads.

As I wrote on one of the relevant GitHub issues, I think a design document that describes the behavior of fiber schedulers in detail, and also addresses some of the "harder" aspects of concurency - error handling, cancellation, determinism, composability - might be beneficial, at the very least as a guiding star for fiber scheduler implementations.

Actions

Also available in: Atom PDF