Bug #17664
closedBehavior of sockets changed in Ruby 3.0 to non-blocking
Description
I'm not sure this is a bug, but apparently a change was introduced in Ruby 3.0 that makes sockets non-blocking by default. This change was apparently introduced as part of the work on the FiberScheduler interface. This change of behaviour is not discussed in the Ruby 3.0.0 release notes.
This change complicates the implementation of an io_uring-based fiber scheduler, since io_uring SQE's on fd's with O_NONBLOCK
can return EAGAIN
just like normal syscalls. Using io_uring with non-blocking fd's defeats the whole purpose of using io_uring in the first place.
A workaround I have put in place in the Polyphony io_uring backend is to make sure O_NONBLOCK
is not set before attempting I/O operations on any fd.
Updated by xtkoba (Tee KOBAYASHI) almost 4 years ago
Updated by jeremyevans0 (Jeremy Evans) almost 4 years ago
- Status changed from Open to Assigned
- Assignee set to ioquatix (Samuel Williams)
I believe this is expected, even if not mentioned in the release notes. I think the Ruby-level API remains the same, but passing the underlying file descriptors to C functions can see changed behavior. Assigning to @ioquatix (Samuel Williams) to confirm this is expected.
Updated by ioquatix (Samuel Williams) over 3 years ago
This change was originally proposed and implemented by @normalperson (Eric Wong).
The outward interface does not change, but you are right it can impact io_uring implementation.
I know this is a problem and ran into the same issue.
I don't know how we should solve the io_uring issue correctly. There are several options:
- Change Ruby default IO back to blocking. But it can cause issues for fiber scheduler since non-blocking hooks will never be invoked.
- In the uring backend, for read/write operations, set the IO to blocking and then revert it afterwards.
- Propose that uring takes a flag/option to do this per SQE, i.e. even if the I/O is set to non-blocking mode, e.g.
OP_READV
should always behave like blocking operation.
Personally, I like the last option best, since I think it's more predictable. We cannot determine the state of the FD just because of Ruby's default. i.e. if a user makes an IO blocking or non-blocking, we prefer if uring read/write behaves predictably. There is a precedent for this with sendmsg/recvmsg too.
I welcome discussion on this point, but for certain, I believe Ruby being non-blocking by default makes sense and that approach was proposed by Eric, and I agreed with it and finally enabled it in Ruby 3. Since there is no outwardly visible change to behaviour, I didn't think it's a big problem, but I also noticed that Eric forced Unicorn IO back to blocking by default, so it might be nice to have their input on the matter.
Updated by normalperson (Eric Wong) over 3 years ago
samuel@oriontransfer.net wrote:
I welcome discussion on this point, but for certain, I believe
Ruby being non-blocking by default makes sense and that
approach was proposed by Eric, and I agreed with it and
finally enabled it in Ruby 3. Since there is no outwardly
visible change to behaviour, I didn't think it's a big
problem, but I also noticed that Eric forced Unicorn IO back
to blocking by default, so it might be nice to have is input
too.
Yes, I proposed non-blocking originally; but gave up the
proposal because of potential incompatibilities (e.g. this one).
I've mostly given up on Ruby (and coding in general),
so the change to non-blocking happened anyways...
Anyways, unicorn doesn't benefit at all from non-blocking socket
I/O since it only handles one fast client at-a-time. In
unicorn, blocking I/O results in fewer syscalls since there's no
intermediate calls to poll/ppoll/select/etc. Non-blocking I/O
only makes sense for slow clients (and unicorn could never and
will never be capable of handling slow clients).
Anyways, I haven't familiarized myself with io_uring, yet; but
maybe I'll get around it (just not for Ruby :P) if I still have
electricity in a few months time...
Updated by ioquatix (Samuel Williams) over 3 years ago
Thanks @normalperson (Eric Wong) - I understand you aren't interested much in Ruby but wish you the best. Thanks for chiming in promptly.
I probably wouldn't characterise this as an incompatibility, because this problem can surface if the user explicitly opts into non-blocking IO. So, that's a valid state for the file descriptor and we need to handle it.
I believe we should involve the authors of io_uring in this discussion because it seems to me we need a more general solution. In general, file descriptors non-blocking state is something that we cannot anticipate. https://github.com/axboe/liburing/issues/364
Updated by ciconia (Sharon Rosner) over 3 years ago
In the uring backend, for read/write operations, set the IO to blocking and then revert it afterwards.
Why would you need to revert it? In practically all cases I can think of, you're going to do all I/O for a given fd on the same scheduler. In addition, if you need to make two additional fcntl
system calls on every I/O operation, it defeats the whole purpose of using io_uring in the first place.
I had another solution in mind, similar to what I mentioned above, but more general:
- Cache the blocking/non-blocking state in an instance variable on the
IO
/Socket
instance. - If the instance variable is not set, call
fcntl
and set the instance variable. - This could be done in order to implement both blocking and non-blocking behavior, according
to the type of fiber scheduler. So a libev-based scheduler would be able to set it to non-blocking,
and an io_uring-based one would set it to blocking.
Pseudo-code:
def check_blocking_state(io, block)
state = io.instance_variable_get(:blocking_state)
if block != state
flags = fcntl(io.fd, F_GETFL)
block ? (flags ~= flags & ~O_NONBLOCK) : (flags |= O_NONBLOCK)
fcntl(io.fd, F_SETFL, flags);
io.instance_variable_set(:blocking_state, block)
end
end
This solution, called before any I/O operation, keeps the extra system calls to a minimum and lets you implement schedulers for both blocking and non-blocking I/O. It's also fully backwards compatible with the core Ruby IO and Socket implementations.
(BTW the current IO implementation, when no fiber scheduler is used, calls fcntl
at least once for basically every I/O operation, which is also a waste in practically all cases.)
Change Ruby default IO back to blocking. But it can cause issues for fiber scheduler since non-blocking hooks will never be invoked.
I think sockets should be changed back to blocking by default, even if just for the sake of consistency. This change took me by surprise, and it cost me a few hours of looking around trying to figure out why I was getting EAGAIN
on sockets and not on files.
Updated by Eregon (Benoit Daloze) over 3 years ago
One potential issue with caching the non-blocking state in Ruby is that a C extension might call fcntl() directly on the fd of an IO to change its non-blocking state.
Not sure if that's done in practice, but it could be an issue.
Updated by ioquatix (Samuel Williams) over 3 years ago
I have researched this topic today and I'm going to share some of my notes and thoughts.
Firstly, with regards to performance, the most important platform is Linux, and I personally believe that io_uring
is going to be the most important interface. We can also support epoll
as a fallback, but it's less complete. Other interfaces, kqueue
is similar to epoll
and is less interesting.
An important point to consider, is that on Linux, I've been told that sockets don't support asynchronous read and/or write. Internally they are emulated by the same user-space implementation - try reading, and on EAGAIN fall back to polling.
In my testing, comparing io_uring
io_read
and io_write
operations, perform about 20% worse in practice in my benchmarks. This was surprising to me. My current understanding as to why it's slow is because when we perform io_read
, internally it performs read
, but because we have to defer that operation until the next iteration of the run loop, we pay quite a bit latency cost here.
The fast path is this:
result = read(fd, ...)
if (EAGAIN) {
wait_readable
Fiber.yield
}
// In scheduler:
select -> fiber.resume(result)
The slow path is this:
io_read(fd, ...) -> OP_READ SQE
io_uring_submit() // optional but improve throughput by 5% if done here
Fiber.yield // Unconditional.
// In scheduler:
wait_cqe -> fiber.resume result
Now, there are actually two interpretations of the above, essentially it depends on the percentage of operations you expect read
to result in EAGAIN. If you expect that percentage to be low, the single system call for read
is far more efficient for Ruby, since we avoid the context switch. In both cases you need to call a system call, either read
or io_uring_submit
. With io_uring_submit
, you can amortise the cost of the system calls, but it turns out that it's much less than the cost of the context switch in Ruby from what I can tell. The interesting point is, when the IO is not so busy, and we expect a higher chance of EAGAIN, the overhead of the yield is far less important.
In my benchmarks using 64+ connections, I only observed OP_READ
-> EAGAIN
< 20 times out of 1 million requests. So, it definitely wasn' a common code path on a server with a lot of active IO. In the case that you DO get EAGAIN
, realistically it seems like you will have to wait for a while anyway, so the extra cost of punting the operation through to io_wait
seems negligible in practice.
So the net result of this is, from what I can measure so far, non-blocking sockets are the most efficient way to handle IO. Forcing sockets to go through OP_READ
seems to yield worse performance in every configuration I could think of. I'm gong to continue investigating this as I'm a little bit unconvinced by the results but based on what I'm seeing, non-blocking sockets (O_NONBLOCK
) seems significantly more efficient. If you can produce benchmarks which show something other that what I've found so far, I'd be most interested and I think it would help make the case that O_NONBLOCK
by default was the wrong choice.
As an aside, I did try to make stdin
, stdout
and stderr
nonblocking.
It turns out it's pretty difficult as a ton of things start breaking in unexpected ways - e.g. printf
.
Fortunately, I think there is a good solution - we do have the ability to check if an IO is in blocking mode, and if that's the case, we can punt it off to OP_READ
which while a little bit slower will do the right thing without needing O_NONBLOCK
. This allows us to have non-blocking stdin, stdout which would be really great.
I'm still working out the details of how this should fit together within io.c
but largely I'm convinced that:
- Non-blocking Socket is the fast path.
- Blocking file descriptors can still be asynchronous. In
io_uring
we can useOP_READ
and inepoll
/kqueue
we can usefcntl
to toggleO_NONBLOCK
. I don't care much about performance impact inepoll
andkqueue
cases since it's not what I'm considering a hot path.
This change complicates the implementation of an io_uring-based fiber scheduler, since io_uring SQE's on fd's with O_NONBLOCK can return EAGAIN just like normal syscalls. Using io_uring with non-blocking fd's defeats the whole purpose of using io_uring in the first place.
I believe the correct implementation here is like this: https://github.com/socketry/event/blob/8c4449ebe0a3c76681655cf175d5aa6589934a9c/ext/event/backend/uring.c#L291-L306
If you get EAGAIN, you should punt the request off to your io_wait
logic.
We are also discussion whether we can get io_uring
to implement this automatically: https://github.com/axboe/liburing/issues/364 but I'm now not sure if this is really the right approach.
One final point to consider is buffer management. If we have thousands of sockets, OP_READ
-based implementations need one buffer per operation at least. But io_wait
style implementation can avoid the need for a buffer until the operation can proceed. io_uring
has some solutions for buffer assignment, but it may not be that easy to take advantage of unless we adopt something like the IO::Buffer
proposal and internally allocate a pool of them for non-blocking IO.
Updated by ioquatix (Samuel Williams) over 3 years ago
Here are some summaries from strace -c
:
Using non-blocking sockets (note the errors column which indicates EAGAIN):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
28.10 0.621514 5 119770 close
25.50 0.564066 4 119530 write
18.39 0.406707 3 119532 2 accept4
14.50 0.320830 2 119844 1 read
13.39 0.296214 2 119922 9 newfstatat
0.04 0.000961 1 494 260 openat
0.02 0.000361 1 186 mmap
0.02 0.000333 0 551 551 readlink
0.01 0.000227 2 87 munmap
0.01 0.000138 2 54 brk
0.01 0.000116 4 26 getdents64
0.00 0.000097 1 80 fcntl
0.00 0.000062 6 9 1 io_uring_enter
0.00 0.000059 0 63 mprotect
0.00 0.000036 0 131 128 ioctl
0.00 0.000030 10 3 getsockname
0.00 0.000029 0 53 geteuid
0.00 0.000028 5 5 2 connect
0.00 0.000027 27 1 io_uring_setup
0.00 0.000026 0 53 getegid
0.00 0.000025 0 52 getuid
0.00 0.000025 0 52 getgid
0.00 0.000020 2 7 socket
0.00 0.000014 2 7 4 prctl
0.00 0.000008 1 5 futex
0.00 0.000007 1 7 rt_sigprocmask
0.00 0.000004 1 3 getpid
0.00 0.000004 2 2 bind
0.00 0.000003 1 2 sendto
0.00 0.000003 1 3 recvmsg
0.00 0.000003 3 1 listen
0.00 0.000002 0 73 lseek
0.00 0.000002 1 2 1 recvfrom
0.00 0.000002 2 1 setsockopt
0.00 0.000001 1 1 ppoll
0.00 0.000000 0 17 rt_sigaction
0.00 0.000000 0 6 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getcwd
0.00 0.000000 0 1 sysinfo
0.00 0.000000 0 1 sigaltstack
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 gettid
0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 timer_create
0.00 0.000000 0 1 clock_gettime
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 2 eventfd2
0.00 0.000000 0 3 prlimit64
0.00 0.000000 0 1 getrandom
------ ----------- ----------- --------- --------- ------------------
100.00 2.211984 3 600654 961 total
Using blocking sockets:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
23.74 0.504460 5 92066 1 io_uring_enter
22.62 0.480659 5 92300 close
16.98 0.360809 1 184200 fcntl
14.83 0.315110 3 92062 2 accept4
11.16 0.237053 2 92373 read
10.55 0.224161 2 92452 9 newfstatat
0.03 0.000714 1 494 260 openat
0.02 0.000484 0 551 551 readlink
0.01 0.000274 4 62 brk
0.01 0.000262 3 87 munmap
0.01 0.000255 1 186 mmap
0.01 0.000210 1 130 127 ioctl
0.01 0.000123 1 63 mprotect
0.00 0.000082 1 73 lseek
0.00 0.000036 2 17 rt_sigaction
0.00 0.000030 0 53 geteuid
0.00 0.000027 0 52 getuid
0.00 0.000026 0 52 getgid
0.00 0.000026 0 53 getegid
0.00 0.000019 0 26 getdents64
0.00 0.000009 4 2 eventfd2
0.00 0.000005 0 7 rt_sigprocmask
0.00 0.000005 5 1 getrandom
0.00 0.000004 4 1 sysinfo
0.00 0.000004 4 1 timer_create
0.00 0.000003 1 3 getpid
0.00 0.000003 3 1 sigaltstack
0.00 0.000003 0 5 futex
0.00 0.000000 0 6 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 7 socket
0.00 0.000000 0 5 2 connect
0.00 0.000000 0 2 sendto
0.00 0.000000 0 2 1 recvfrom
0.00 0.000000 0 3 recvmsg
0.00 0.000000 0 2 bind
0.00 0.000000 0 1 listen
0.00 0.000000 0 3 getsockname
0.00 0.000000 0 1 setsockopt
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getcwd
0.00 0.000000 0 7 4 prctl
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 gettid
0.00 0.000000 0 1 sched_getaffinity
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 clock_gettime
0.00 0.000000 0 1 ppoll
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 3 prlimit64
0.00 0.000000 0 1 io_uring_setup
------ ----------- ----------- --------- --------- ------------------
100.00 2.124856 3 647427 959 total
Based on this, maybe my implementation of read
is not working correctly. I'll have to check it, but generally, you can see the big difference.
Updated by ciconia (Sharon Rosner) over 3 years ago
In my testing, comparing
io_uring
io_read
andio_write
operations, perform about 20% worse in practice in my benchmarks. This was surprising to me. My current understanding as to why it's slow is because when we performio_read
, internally it performsread
, but because we have to defer that operation until the next iteration of the run loop, we pay quite a bit latency cost here.
I think the increased latency is to be expected. Did you measure throughput? In my own benchmarks (on Polyphony) I've seen better throughput, slightly worse latency (don't remember the numbers though).
In my benchmarks using 64+ connections, I only observed OP_READ -> EAGAIN < 20 times out of 1 million requests.
My understanding is sockets are buffered, so normally you'll see EAGAIN only if you saturate them.
Using blocking sockets:
Looking at those numbers a few things stand out (I'm assuming your benchmark is with some kind of HTTP server):
- There's no line for
write
(is the blocking version working correctly?) -
fcntl
is at 17% - this is a serious cost if you need to do this on everyread
/write
-
io_uring_enter
is called once for eachaccept
, so apparently there's no batching of SQEs.
Skimming your io_uring backend code I see you're iterating over available CQEs and resume the fiber for each CQE while iterating. In my experience you'll get better numbers if you put those resumable fibers in an array instead, then resume them one by one after having exhausted all available CQEs.
Hope this helps!
Updated by ioquatix (Samuel Williams) over 3 years ago
My understanding is sockets are buffered, so normally you'll see EAGAIN only if you saturate them.
For write
this seems totally reasonable, but I was also checking read
which you'd expect to block more often.
fcntl is at 17% - this is a serious cost if you need to do this on every read/write
Yes, agreed. I'll re-run the benchmark after recompiling Ruby with blocking sockets to see if there is an impact or not.
Skimming your io_uring backend code I see you're iterating over available CQEs and resume the fiber for each CQE while iterating.
I can certainly try that. I'm not sure how putting the fibers into the ready array would help since it adds an extra layer of indirection, but I can imagine that it frees up the CQ before new entries are entered into the SQ.
Updated by ioquatix (Samuel Williams) over 3 years ago
I missed some functions in io.c
which could invoke read
. That's why read
was showing up. Now that's been patched:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
41.67 0.099333 4 21439 1 io_uring_enter
27.40 0.065317 5 10957 close
17.63 0.042024 3 10719 2 accept4
12.07 0.028768 2 11109 9 newfstatat
0.24 0.000570 1 313 read
All read and write operations are going via the uring. I feel fairly confident of this (on io-buffer
branch).
Measuring performance systematically is a bit of a pain but here are some comparisons.
Comparing io_uring_submit
(right away after prep sqe vs deferred)¶
I found in some cases it's advantageous to call it right away, but in this case, it didn't seem to help much. But at least you can see impact to sys call count to confirm it's working.
All I/O is non-blocking in this test. DirectScheduler
means it implements io_read
and io_write
(OP_READ
and OP_WRITE
respectively). Scheduler
means it doesn't implement io_read
and io_write
which forces io_wait
path. strace -c
is captured separately and the numbers are different but proportionally correct (because it's slower).
Early io_uring_submit
(DirectScheduler
, 128 connections)¶
Running 30s test @ http://localhost:9090
4 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.52ms 578.19us 13.65ms 64.86%
Req/Sec 9.03k 0.95k 18.68k 72.95%
1076598 requests in 30.05s, 47.23MB read
Requests/sec: 35831.03
Transfer/sec: 1.57MB
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
41.33 0.090196 4 21218 io_uring_enter
25.61 0.055886 5 10753 close
16.64 0.036306 3 10516 3 accept4
11.46 0.025016 2 10905 9 newfstatat
2.80 0.006115 2 2079 mprotect
Deferred io_uring_submit
(DirectScheduler
, 128 connections)¶
Running 30s test @ http://localhost:9090
4 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.41ms 677.38us 12.00ms 63.73%
Req/Sec 9.31k 1.01k 11.65k 56.88%
1110951 requests in 30.05s, 48.74MB read
Requests/sec: 36975.48
Transfer/sec: 1.62MB
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
36.38 0.074410 6 11745 io_uring_enter
27.47 0.056179 4 11987 close
18.97 0.038801 3 11751 4 accept4
13.02 0.026624 2 12139 9 newfstatat
2.94 0.006010 2 2079 mprotect
Early io_uring_submit
(Scheduler
, 128 connections)¶
Running 30s test @ http://localhost:9090
4 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.94ms 567.14us 17.13ms 66.30%
Req/Sec 10.81k 1.02k 13.44k 58.47%
1289574 requests in 30.04s, 56.57MB read
Requests/sec: 42926.80
Transfer/sec: 1.88MB
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
23.69 0.048398 4 9924 close
22.35 0.045668 4 9684 write
16.04 0.032769 3 9687 3 accept4
12.88 0.026323 2 10076 79 read
11.58 0.023663 2 10076 9 newfstatat
9.70 0.019821 2 9767 io_uring_enter
2.76 0.005635 2 2079 mprotect
Deferred io_uring_submit
(Scheduler
, 128 connections)¶
I feel like the io_uring_enter
count here is a bit off. Maybe because the select
operation always calls io_uring_enter
even if there are no SQEs outstanding. Perhaps it should only be called if the SQ is not empty.
Running 30s test @ http://localhost:9090
4 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.04ms 557.79us 21.41ms 67.67%
Req/Sec 10.44k 0.96k 13.07k 64.22%
1245909 requests in 30.07s, 54.66MB read
Requests/sec: 41434.63
Transfer/sec: 1.82MB
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
23.57 0.057679 5 9964 close
22.06 0.053990 5 9724 write
15.48 0.037879 3 9727 3 accept4
12.95 0.031689 3 10116 79 read
11.50 0.028137 2 10116 9 newfstatat
9.24 0.022621 2 9725 1 io_uring_enter
3.43 0.008383 4 2079 mprotect
Updated by ioquatix (Samuel Williams) over 3 years ago
I rewrote the uring
implementation to track the number of pending operations. I'm kind of surprised that the SQ doesn't do this.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
26.59 0.049621 4 10575 close
24.85 0.046369 4 10335 write
17.34 0.032363 3 10338 3 accept4
13.77 0.025693 2 10727 79 read
11.82 0.022051 2 10727 9 newfstatat
5.31 0.009911 4 2079 mprotect
0.11 0.000207 1 192 mmap
0.09 0.000165 1 87 munmap
0.05 0.000087 1 48 brk
0.04 0.000083 1 83 1 io_uring_enter
Now for non-blocking sockets, we only see io_uring_enter
count proportional to the number of read
and write
errors which makes more sense to me.
Running 30s test @ http://localhost:9090
4 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.87ms 1.36ms 90.28ms 98.60%
Req/Sec 11.16k 1.40k 26.84k 70.73%
1332172 requests in 30.04s, 58.44MB read
Requests/sec: 44341.92
Transfer/sec: 1.95MB
As you can imagine, with only 89 calls to io_uring_enter
, whether or not it's done early or later has little impact on overall performance.
Updated by ioquatix (Samuel Williams) over 3 years ago
Here is DirectScheduler
:
Running 30s test @ http://localhost:9090
4 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.55ms 658.12us 14.96ms 64.38%
Req/Sec 8.92k 0.99k 11.92k 72.81%
1063854 requests in 30.09s, 46.67MB read
Requests/sec: 35361.57
Transfer/sec: 1.55MB
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ------------------
37.77 0.797226 6 120032 io_uring_enter
29.02 0.612538 5 120277 close
19.45 0.410488 3 120041 4 accept4
13.24 0.279448 2 120429 9 newfstatat
0.31 0.006596 3 2079 mprotect
I guess it's at least 20-30% slower. My gut feeling is, greedy read
which avoids fiber context switch improves throughput.
Based on this result, I'm still unconvinced we should change sockets back to blocking.
However, we do want to enable DirectScheduler
to work efficiently with both.
I propose the following changes:
-
DirectScheduler
makes sense for general IO including blocking IO.IO#read
andIO#write
should invoke the fiber schedulerio_read
andio_write
respectively. This enables things like non-blocking read/write tostdin
andstdout
without making themO_NONBLOCK
.io_uring
supports this directly, whileepoll
andkqueue
will need to use afcntl
wrapper. -
Socket#read
andSocket#write
should be implemented via a different scheduler hook, maybesocket_read
andsocket_write
, to go along with what will eventually includesocket_recvmsg
andsocket_sendmsg
etc. The implementation ofsocket_read
andsocket_write
could be the same asio_read
andio_write
, but for performance reasons should useread|write -> EAGAIN -> polling
instead. - We might need to check the most efficient way to deal with pipes, I suspect they are more similar to sockets internally than files.
@ciconia (Sharon Rosner) what do you think?
Updated by ioquatix (Samuel Williams) over 3 years ago
I was playing around with a larger number of connections and the deferred submit:
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
io_write:Event_Backend_fiber_transfer -> 46
select_process_completions(completed=38)
So certainly seems that it's capable of handling lots of events. But what I noticed is that there are a lot of minimally interleaved iterations:
io_write:Event_Backend_fiber_transfer -> 46
io_read:Event_Backend_fiber_transfer -> 40
select_process_completions(completed=2)
io_write:Event_Backend_fiber_transfer -> 46
io_read:Event_Backend_fiber_transfer -> 40
select_process_completions(completed=2)
io_write:Event_Backend_fiber_transfer -> 46
io_read:Event_Backend_fiber_transfer -> 40
select_process_completions(completed=2)
io_write:Event_Backend_fiber_transfer -> 46
io_read:Event_Backend_fiber_transfer -> 40
select_process_completions(completed=2)
Maybe because #accept
and #close
are not asynchronous, it causes more issues.
Updated by ciconia (Sharon Rosner) over 3 years ago
I missed some functions in io.c which could invoke read. That's why read was showing up.
Sorry, this tripped me up and I was looking for a corresponding write
line.
I propose the following changes:
DirectScheduler
makes sense for general IO including blocking IO.IO#read
andIO#write
should invoke the fiber schedulerio_read
andio_write
respectively. This enables things like non-blocking read/write tostdin
andstdout
without making themO_NONBLOCK
.io_uring
supports this directly, whileepoll
andkqueue
will need to use afcntl
wrapper.Socket#read
andSocket#write
should be implemented via a different scheduler hook, maybesocket_read
andsocket_write
, to go along with what will eventually includesocket_recvmsg
andsocket_sendmsg
etc. The implementation ofsocket_read
andsocket_write
could be the same asio_read
andio_write
, but for performance reasons should useread|write -> EAGAIN -> polling
instead.- We might need to check the most efficient way to deal with pipes, I suspect they are more similar to sockets internally than files.
This seems fine to me, but to play the devil's advocate you are making a design decision which is based on:
- Anecdotal benchmarks - the performance difference you see might be reversed in different circumstances.
- A fiber scheduler implementation that is external to Ruby.
Another point I wanted to bring up is that if you are indeed going to implement this kind of behavior in a fiber scheduler then the fiber switching becomes non-deterministic. This has ramifications for the behavior of user programs, in two important ways:
- You will not be able to tell whether a fiber switch happens on calling
IO#read
et al, which might lead to difficulties in debugging. - Cancelling an I/O operation where there's no fiber switch happening becomes impossible. Cancellation is a whole subject in itself, it is not addressed at all by the current fiber scheduler spec, and IMO is a crucial aspect of managing concurrency.
I'll just give a very simple example (I don't know how Fiber#raise
interacts with the fiber scheduler mechanism, if at all, but let's suppose the fiber scheduler knows how to deal with that):
f1 = Fiber.schedule do
@io.write('foo') # is a ctx switch happening here?
puts 'oh hi' # or here?
rescue
@some_other_io.puts 'oh bye' # or here?
end
f2 = Fiber.schedule do
f1.raise
end
With your proposition, the output of the above program will change according to the fiber scheduler implementation and whether @io
is a socket or file or something else.
I think it would be better to always do a fiber switch on any I/O. That's what I do for example in the Polyphony libev backend: if the read was immediately successful, the fiber snoozes (schedules itself and yields control to some other fiber). Deterministic behavior is IMO one of the main advantages of using fibers compared to threads.
As I wrote on one of the relevant GitHub issues, I think a design document that describes the behavior of fiber schedulers in detail, and also addresses some of the "harder" aspects of concurency - error handling, cancellation, determinism, composability - might be beneficial, at the very least as a guiding star for fiber scheduler implementations.
Updated by eviljoel (evil joel) almost 2 years ago
@ioquatix (Samuel Williams), I maintain a proprietary, custom OpenSSL C extension as part of my company's security scanner. This change broke the unit tests for our C extension. (This change probably also delayed our upgrade to Ruby 3 from Ruby 2.7 due to the complexity of debugging this issue.) OpenSSL's SSL_accept function inherits the blocking behavior of the socket file descriptor. Since our C extension unit tests relied on SSL_accept being blocking, upgrading to Ruby 3+ broke our unit tests. (Also, it took me a while to figure out it was just the unit tests that broke and not the code being tested.)
This "bug" has been around for some time now and is estalished behavior, so I don't expect this aspect of the Fiber implementation to be reverted, but I did want to raise the fact that this undocumented change did break somebody's code somewhere.
Also, I wanted to ask whether it is an acceptable work around to set the socket file descriptor back to blocking (via C)? Would doing this have unforeseen consequences if I'm not using Fibers? (I know this isn't the best place to ask support questions but others are likely to come here with the same question.)
Thank you.
Updated by ioquatix (Samuel Williams) almost 2 years ago
@eviljoel (evil joel) thanks for your report.
My initial thoughts are:
- If you are using Ruby to construct sockets, you should probably use the Ruby methods to accept connections, etc. They will correctly deal with this. At worst, you might be blocking without releasing the GVL. Better to use
rb_io_wait
. - I don't think we will roll back this change but it's not impossible - in theory it should not be externally visible. The reason for non-blocking IO is nuanced and complex, and we might eventually be able to achieve what we want without making all file descriptors non-blocking.
- You can set a socket back to blocking if you don't intend for Ruby code to use it directly, but make sure you are handling releasing the GVL.
Updated by eviljoel (evil joel) almost 2 years ago
@ioquatix (Samuel Williams), thank you for your reply.
The sockets are created in Ruby. I'm interacting with OpenSSL via FFI, so that part is also largely in Ruby.
Since I commented yesterday, I fixed the code to work with Ruby 3. I changed all relevant calls to OpenSSL to behave in a non-blocking manner leveraging IO.select. It really wasn't as difficult as I expected.
Thanks again.
Updated by ioquatix (Samuel Williams) almost 2 years ago
Regarding IO.select
, my advice is to prefer wait_readable
or wait_writable
if possible.
Updated by eviljoel (evil joel) almost 2 years ago
I can't use IO.wait_readable and IO.wait_writable because OpenSSL can renegotiate with two-way communication at any time. Blocking while waiting exclusively for reading or exclusively for writing would cause the program (or any OpenSSL plugin) to hang. I could probably use the 'events' form of IO#wait however. Thanks again.
Updated by ioquatix (Samuel Williams) almost 2 years ago
I could probably use the 'events' form of IO#wait however.
Yes, you should do that. IO.select
for a single file descriptor is hard to implement efficiently into the event loop.
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
- Status changed from Assigned to Closed