Feature #5677
closedIO C API
Description
This is related to the proposal in [ruby-core:41321][1].
I'd like to take advantage of streaming IO in an extension I am
working on. The problem I'm having is that I don't want to call
IO#read on the rb_funcall level because that would kill the
performance due to wrapping the bytes into Ruby objects back and
forth again.
I saw two solutions to my problem:
-
Duplicating the file descriptor to obtain a pure FILE*
like it is done in ext/openssl/ossl_bio.c[2] and continue
working on the raw FILE*. -
Since I really only need to read and write on the stream,
I was looking for public Ruby C API that would support me
in the process, and I found
- ssize_t rb_io_bufwrite(VALUE io, const void *buf, size_t size)
- ssize_t rb_io_bufread(VALUE io, void *buf, size_t size)
I think both cases are valid use cases, 1. is likely necessary
if there is the need to pass a FILE* on to an external C library,
2. is for cases like mine where there is the need to operate
on raw C data types for performance reasons.
The problem, though, is that only rb_io_bufwrite is public API in io.h,
rb_io_bufread is declared private in internal.h and rb_cloexec_dup is
semi-public in intern.h.
Could we make rb_io_bufread public API in io.h as well? What about
rb_cloexec_dup?
[1] http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/41321
[2] https://github.com/ruby/ruby/blob/trunk/ext/openssl/ossl_bio.c#L17
Updated by normalperson (Eric Wong) almost 13 years ago
Martin Bosslet Martin.Bosslet@googlemail.com wrote:
This is related to the proposal in [ruby-core:41321][1].
I'd like to take advantage of streaming IO in an extension I am
working on. The problem I'm having is that I don't want to call
IO#read on the rb_funcall level because that would kill the
performance due to wrapping the bytes into Ruby objects back and
forth again.
Is starting with Ruby String objects (with binary encoding) and then
having read(2)/write(2) hit RSTRING_PTR not possible?
I saw two solutions to my problem:
- Duplicating the file descriptor to obtain a pure FILE*
like it is done in ext/openssl/ossl_bio.c[2] and continue
working on the raw FILE*.
That may be from the old 1.8 days when all IO objects wrapped FILE *.
It might be better to use BIO_new_fd() nowadays instead since 1.9
generally prefers bare file descriptors (for all fd > 2).
- Since I really only need to read and write on the stream,
I was looking for public Ruby C API that would support me
in the process, and I found
- ssize_t rb_io_bufwrite(VALUE io, const void *buf, size_t size)
- ssize_t rb_io_bufread(VALUE io, void *buf, size_t size)
Is userspace buffering really necessary in your case?
If you're working with sockets/pipes, I would reckon not (Ruby already
defaults to IO#sync=false on sockets/pipes when writing). If you're
reading (and probably parsing), you would need to do your own read
buffering anyways, no?
I think both cases are valid use cases, 1. is likely necessary
if there is the need to pass a FILE* on to an external C library,
It's not easily possible to share userspace buffers in FILE * with
userspace buffers in rb_io_t. Userspace buffering is pretty miserable
and error-prone whenever/wherever IPC is concerned.
- is for cases like mine where there is the need to operate
on raw C data types for performance reasons.
It depends on what you're doing, but if performance is a concern you
should try to work on largish chunks off the file descriptor and
skip the userspace buffering stages. Userspace buffering can improve
performance by reducing syscalls, but it can also double the memory
bandwidth required to do things.
Updated by MartinBosslet (Martin Bosslet) almost 13 years ago
Eric Wong wrote:
First off, thanks for your comments.
Martin Bosslet Martin.Bosslet@googlemail.com wrote:
This is related to the proposal in [ruby-core:41321][1].
I'd like to take advantage of streaming IO in an extension I am
working on. The problem I'm having is that I don't want to call
IO#read on the rb_funcall level because that would kill the
performance due to wrapping the bytes into Ruby objects back and
forth again.Is starting with Ruby String objects (with binary encoding) and then
having read(2)/write(2) hit RSTRING_PTR not possible?
You mean reading String chunks from the underlying IO? I'm afraid not.
The only way I could right now is calling the Ruby methods for
IO#read/write using rb_funcall. But there's a lot of overhead involved,
VM roundtrip plus lots of short-lived objects that trigger GC. It would
likely end up being slower than the current ASN1.decode, a situation I'd
like to avoid.
I saw two solutions to my problem:
- Duplicating the file descriptor to obtain a pure FILE*
like it is done in ext/openssl/ossl_bio.c[2] and continue
working on the raw FILE*.That may be from the old 1.8 days when all IO objects wrapped FILE *.
It might be better to use BIO_new_fd() nowadays instead since 1.9
generally prefers bare file descriptors (for all fd > 2).
Good point, I will look into using it instead.
- Since I really only need to read and write on the stream,
I was looking for public Ruby C API that would support me
in the process, and I found
- ssize_t rb_io_bufwrite(VALUE io, const void *buf, size_t size)
- ssize_t rb_io_bufread(VALUE io, void *buf, size_t size)
Is userspace buffering really necessary in your case?
No, not really, but currently it's the only way the C API allows
me to do C-level streaming on an IO.
If you're working with sockets/pipes, I would reckon not (Ruby already
defaults to IO#sync=false on sockets/pipes when writing). If you're
reading (and probably parsing), you would need to do your own read
buffering anyways, no?
see below
I think both cases are valid use cases, 1. is likely necessary
if there is the need to pass a FILE* on to an external C library,It's not easily possible to share userspace buffers in FILE * with
userspace buffers in rb_io_t. Userspace buffering is pretty miserable
and error-prone whenever/wherever IPC is concerned.
- is for cases like mine where there is the need to operate
on raw C data types for performance reasons.It depends on what you're doing, but if performance is a concern you
should try to work on largish chunks off the file descriptor and
skip the userspace buffering stages. Userspace buffering can improve
performance by reducing syscalls, but it can also double the memory
bandwidth required to do things.
Yes, I would have to do my own buffering during parsing in any case, so
double buffering means unneccesary waste of memory. I guess making a clean
cut and working on the file descriptor directly seems like the best solution.
Still, I am wondering if there is the need for a low-level C API for doing
IO on Ruby IO objects, or is the "clean cut approach" using the file descriptor
directly the recommended solution in any case?
Updated by normalperson (Eric Wong) almost 13 years ago
Martin Bosslet Martin.Bosslet@googlemail.com wrote:
Eric Wong wrote:
Martin Bosslet Martin.Bosslet@googlemail.com wrote:
This is related to the proposal in [ruby-core:41321][1].
I'd like to take advantage of streaming IO in an extension I am
working on. The problem I'm having is that I don't want to call
IO#read on the rb_funcall level because that would kill the
performance due to wrapping the bytes into Ruby objects back and
forth again.Is starting with Ruby String objects (with binary encoding) and then
having read(2)/write(2) hit RSTRING_PTR not possible?You mean reading String chunks from the underlying IO? I'm afraid not.
The only way I could right now is calling the Ruby methods for
IO#read/write using rb_funcall. But there's a lot of overhead involved,
VM roundtrip plus lots of short-lived objects that trigger GC. It would
likely end up being slower than the current ASN1.decode, a situation I'd
like to avoid.
You can avoid short-lived objects by passing Strings as the second
argument to IO#read-like methods:
`
buf = ""
while r.read(16384, buf)
w.write(buf)
end
Without GC calls happening, I don't expect significant overhead from the VM.
If you're working with sockets/pipes, I would reckon not (Ruby already
defaults to IO#sync=false on sockets/pipes when writing).
Err, typo on my part, IO#sync=true is the default (meaning no userspace
buffering is the default).
Still, I am wondering if there is the need for a low-level C API for doing
IO on Ruby IO objects, or is the "clean cut approach" using the file descriptor
directly the recommended solution in any case?
I recommend directly working off the file descriptor. The only tricky
part is making sure there's nothing in the userspace buffers beforehand.
You'd probably need to call IO#rewind/IO#seek to sync read buffers up
and IO#flush (and then IO#sync=true) for write bufffers.
Updated by nobu (Nobuyoshi Nakada) over 12 years ago
- Status changed from Open to Assigned
- Assignee set to akr (Akira Tanaka)
MartinBosslet (Martin Bosslet) wrote:
- Duplicating the file descriptor to obtain a pure FILE*
like it is done in ext/openssl/ossl_bio.c[2] and continue
working on the raw FILE*.
Can't you use rb_io_stdio_file()?
And OpenSSL seems providing BIO_new_fd() too.
The problem, though, is that only rb_io_bufwrite is public API in io.h,
rb_io_bufread is declared private in internal.h and rb_cloexec_dup is
semi-public in intern.h.Could we make rb_io_bufread public API in io.h as well? What about
rb_cloexec_dup?
It doesn't seem bad to me.
They are all added to internal.h by akr.
Updated by MartinBosslet (Martin Bosslet) over 12 years ago
nobu (Nobuyoshi Nakada) wrote:
Can't you use rb_io_stdio_file()?
And OpenSSL seems providing BIO_new_fd() too.
True, and that's also what I should be using there :) I'll fix it.
The problem, though, is that only rb_io_bufwrite is public API in io.h,
rb_io_bufread is declared private in internal.h and rb_cloexec_dup is
semi-public in intern.h.Could we make rb_io_bufread public API in io.h as well? What about
rb_cloexec_dup?It doesn't seem bad to me.
They are all added to internal.h by akr.
I've done a lot of IO in C over the last months, now
I have a much clearer picture of it.
I think my only problem was that there is no unified read/write
that allows working on arbitrary IOs efficiently (I could fall
back to rb_funcall, but then I'd give away the performance benefits).
I had to make the distinction between IOs based on rb_io_t,
between StringIO and raw Strings. I ended up in writing my own
wrapper that would abstract away the differences. It would be
really nice to have such an abstraction directly in the API.
Would this be an option?
Updated by akr (Akira Tanaka) about 12 years ago
- Target version changed from 2.0.0 to 2.6
Currently Ruby doesn't provide an abstract layer dedicated for IO and IO-like classes.
Or, in other words, Ruby provide such layer by method dispatching by OOP (polymorphism).
I guess it's difficult to introduce another dispatch mechanism for Ruby.
Making rb_io_bufread and rb_cloexec_dup as public API is much easier.
Updated by MartinBosslet (Martin Bosslet) about 12 years ago
In hindsight, my major concern with current IO C API is the impossibility to optimize something for StringIO. Since StringIO is not part of core, a C extension must typically fall back to calling methods of the Ruby IO API, which is not very efficient.
It would be nice to abstract the implementation details away, calling C API methods that take care of the details and efficiency - but I agree, that would be a lot of work. But still, is a revised IO API possible in the future? I think IO-heavy C extensions would almost certainly benefit from it.
Updated by naruse (Yui NARUSE) about 12 years ago
I heard yugui is planning (or something) such an abstract implementation before.
How's going? > yugui
Updated by yugui (Yuki Sonoda) almost 12 years ago
=begin
The target of the project I am working on is to provide an easy way to write a IO-compatible class in Ruby.
The project does not aim to achieve high performance. So what Martin wants sounds little different from my project.
((URL:http://github.com/yugui/ioable))
=end
Updated by akr (Akira Tanaka) over 11 years ago
- Status changed from Assigned to Feedback
It seems no one design/implement seriously on such an IO framework.
Updated by akr (Akira Tanaka) over 10 years ago
I reject this issue because no one implement it until now.
It is too difficult to discuss concretely without implementation.
Updated by akr (Akira Tanaka) over 10 years ago
- Status changed from Feedback to Rejected