Feature #19642
closedRemove vectored read/write from `io.c`.
Description
https://github.com/socketry/async/issues/228#issuecomment-1546789910 is a comment from the kernel developer who tells us that writev
is always worse than write
system call.
A large amount of complexity in io.c
comes from optional support from writev
.
So, I'd like to remove support for writev
.
I may try to measure the performance before/after. However it may not show much difference, except that the implementation in io.c
can be much simpler.
Updated by ioquatix (Samuel Williams) over 1 year ago
- Tracker changed from Bug to Feature
- Backport deleted (
3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN)
Updated by ioquatix (Samuel Williams) over 1 year ago
- Assignee set to ioquatix (Samuel Williams)
Updated by ioquatix (Samuel Williams) over 1 year ago
Updated by ioquatix (Samuel Williams) over 1 year ago
I would like to do a more comprehensive review of performance, but it seems minimal, even in the best possible circumstance:
| |compare-ruby|built-ruby|
|:---------|-----------:|---------:|
|io_write | 2.098| 2.089|
| | 1.00x| -|
Updated by tenderlovemaking (Aaron Patterson) over 1 year ago
I understand the concern of copying the iovec
, but it seems like the overhead of making N calls to write
would at some point be more expensive than copying the struct once. I guess under the hood writev
may just be calling write
, in which case copying the struct doesn't make sense.
This patch simplifies the code, so if there's no significant perf difference then it seems fine to me. 😃
Updated by ioquatix (Samuel Williams) over 1 year ago
Thanks for your feedback Aaron.
The concern is less about the internal overhead.
I'm sure different OS can optimise for different situations, e.g. in some cases I imagine writev
can be faster than write
if the system call overhead is large.
The advice we received for Linux from Jens Axboe is that write
is always more efficient that writev
. I've been looking at the complexity in io.c
and thinking about ways to reduce it. Jen's comment was enough to make me think we can get away with removing writev
since it adds a lot of complexity.
In our case, our vectored write "public interface" doesn't change, i.e. io.write(a, b, c)
still works and we could at any point opt to use writev
again. The difference in performance is at best 5% in favour of writev
but only in extreme cases and I suspect we can narrow this significantly, i.e. a small change to rb_io_puts
changed the overhead from 5% difference to 2-3% which is barely measurable in real world programs.
From an implementation point of view, the biggest convenience is:
- Removal of 300 lines of optional code for handling
writev
. - The fiber scheduler will probably never support
writev
as it significantly increases the complexity of every piece of the interface with near zero performance improvement. - With the removal of
writev
, we could focus more on buffering internally to avoid multiple calls to write which is better thanwritev
.
I do care about the performance, if there is a big difference, we shouldn't consider this PR, but it also greatly simplifies the code, which I like.
Updated by usa (Usaku NAKAMURA) over 1 year ago
If I remember correctly, writev was introduced for atomic writes, not for performance.
(I am neutral to remove writev.)
Updated by ioquatix (Samuel Williams) over 1 year ago
Thanks Nakamura-san for your feedback.
According to POSIX:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.
I assume since writev
is defined in terms of write
, that writev
of total size <= PIPE_BUF
will be atomic. It is not defined what happens when the size is bigger than PIPE_BUF
, but we can assume it is non-atomic.
Since we already have a write lock in IO
, writes should be atomic at the Ruby IO level. But not at the kernel level and not between processes that share the same pipes.
irb(main):001:0> $stderr.sync
=> true
irb(main):002:0> $stdout.sync
=> true
It looks like $stdout
and $stderr
are both buffered internally.
I think Ruby should guarantee buffered writes will be atomic up to PIPE_BUF.
Unbuffered writes should have no atomicity guarantees.
If it's not documented, maybe we can consider adding that.
Updated by Anonymous over 1 year ago
"ioquatix (Samuel Williams) via ruby-core" ruby-core@ml.ruby-lang.org wrote:
irb(main):001:0> $stderr.sync => true irb(main):002:0> $stdout.sync => true
It looks like
$stdout
and$stderr
are both buffered internally.
IO#sync==true means unbuffered.
I think Ruby should guarantee buffered writes will be atomic up to PIPE_BUF.
PIPE_BUF is only relevant for pipes/FIFOs (not regular files, sockets, etc..).
Unbuffered writes should have no atomicity guarantees.
Each write/writev syscall to regular files is atomic with O_APPEND.
It's how multi-process servers can write to the same log file
without interleaving lines.
The limit is not PIPE_BUF, but (IIRC) roughly INT_MAX/SSIZE_MAX;
It may be FS-dependent, but it's large enough to not matter on
reasonable local FSes.
Side notes:
Also, any writev benchmarks you'd do should account for the
common pattern of giant bodies prefixed with a small header.
(e.g. HTTP/1.1 chunking, tnetstrings(*), etc).
The 128 bytes you use in the benchmark is tiny and unrealistic.
writev has high overhead with many small chunks (each iovec is
16 bytes), and most I/O size is far larger than 128 bytes.
I understand that writev is slow for the kernel, but (alloc +
memcpy + GC/free) w/ giant strings is slow in userspace, too.
Stuff like:
io.write("#{buf.bytesize.to_s(16)}\r\n", buf, -"\r\n")
When buf is a gigantic string (multiple KB or MB) is where I
expect writev to avoid copies and garbage.
Adding a buffer_offset (tentative name) arg to write_nonblock
would make retrying non-blocking I/O much easier:
n = 0
tot = ary.map(&:bytesize).sum
while n < tot
case w = io.write_nonblock(*ary, buffer_offset: n, exception: false)
when :wait_readable, :wait_writable
break # or schedule or whatever
else
n += w
end
end
Perl's syswrite/read/sysread ops have the OFFSET arg for decades.
Updated by hsbt (Hiroshi SHIBATA) 9 months ago
- Status changed from Open to Assigned
Updated by ioquatix (Samuel Williams) about 1 month ago
- Status changed from Assigned to Closed
I am no longer planning to do this.