Project

General

Profile

Actions

Bug #18417

closed

IO::Buffer problems

Added by zverok (Victor Shepelev) about 3 years ago. Updated about 3 years ago.

Status:
Closed
Target version:
-
[ruby-core:106730]

Description

Hello. While working for documentation for IO::Buffer (still WIP), I uncovered these problems. I believe at least some of them should be fixed before the release, though I understand not much time left.

The list is not final, probably (still on it, but wanted to raise awareness ASAP).

These should be probably fixed/discussed before the release

1. I believe to_str is an unfortunate name for "read string from buffer, maybe entire buffer. It is Ruby's convention that to_str is an implicit String-conversion, and a) I am not sure the Buffer is "string-like" enough to be implicitly converted, and b) it is really unexpected that #to_s and #to_str do different things:

buf = IO::Buffer.for('test')

"buf: #{buf}"
# => "buf: #<IO::Buffer 0x00007fdb7f4e3a20+4 SLICE>"
"buf: " + buf
# => "buf: test"

puts buf
# #<IO::Buffer 0x00007fdb7f4e3a20+4 SLICE>
puts '' + buf
# test

Another concern about #to_str naming is it is one of the main parts of the buffer interface (alongside #copy), but doesn't look this way, and looks like some utility method instead. Maybe copy/to_str pair should be symmetrical, like write/read, or something like that?

2. I am not sure that type names (:U8, :S8, :U16, :u16 and so on) introduced "from scratch" (?) is a good thing. From one point, they are kinda nice, and consistent. From another, we already have abbreviated names for the types in Array#pack, and having two completely unrelated sets of designation for the same thing in the core of the language seems weird to me. I understand that not all #packs designations can be represented by Symbol, and their consistency is questionable (it is S for 16-bit unsigned LE, s for 16-bit signed LE, and S>/s> for the same in BE), but the discrepancy is questionable too.

3. About flags as a creation parameter:

  • ::new: it seems that the only acceptable value is MAPPED; of others, INTERNAL is the default, EXTERNAL, IMMUTABLE and LOCKED raise on creation;
  • ::for: no flags accepted (though it seems that a method to make such buffer immutable might be of some utility?)
  • ::map: INTERNAL, LOCKED, EXTERNAL are ignored, MAPPED is the default; seems like the only usage of the parameter is not to pass IMMUTABLE which is the default: and for that, you'll need to also pass size and offset.

It seems, that we shouldn't probably expose "flags" concept to the user at all, and instead make APIs like ::new(size, mapped: false) and ::map(io, ..., immutable: true)?

Probably bugs

  • It seems that #clear behaves weirdly when offset is provided but length is not:
buf = IO::Buffer.new(4)
buf.clear(1, 2) # it seems obvious that I want to clear from byte 2 to the end of the buffer
# in `clear': Offset + length out of bounds! (RuntimeError)
  • IO::Buffer.map('README.md') (just a string, I experimented "whether it is smart enough") core dumps.
  • I suspect we might want to set buffer to IMMUTABLE if it is created from the frozen string. Currently, this works:
str = 'test'.freeze
buf = IO::Buffer.for(str)
buf.set(:U8, 1, 111)
str
#=> "tost"
  • It seems to_str always returns a string in US-ASCII encoding, I am not sure it is the right behavior. For a low-level buffer, I'd rather expect ASCII-8BIT (what file read in binary mode produces in Ruby).
  • (Not sure if a bug or just not very useful behavior?) #resize applied to a buffer mapped from file seems to "break" the link between the buffer and file; no further copy or set would be reflected on file.
  • buf = IO::Buffer.new(4); buf.free; buf.inspect gives a core dump
  • buf = IO::Buffer.for("test"); buf.free — after that, buf.get(:U8, 0) gives "Buffer has been invalidated! (RuntimeError)", but buf.to_str works and produces "\x00\x00\x00\x00"; buf.to_str(2) does a core dump

Inconsistencies and other problems

  • The buffer raises RuntimeError as the only type of the error. I believe that in some places, it should be other standard errors (for example, "Offset + length out of bounds" could be at least ArumentError?..), and in other occurrences, special error classes: for example, "Buffer is locked" would probably be helpful to have as some IO::Buffer::LockedError
  • Current #inspect is kinda nice and useful, but it also might be really annoying when used by Ruby itself. Try IO::Buffer.map(File.open("big file")).unexisting_method, and you'll get REALLY long error message.
  • Can we maybe have for #resize a signature resize(new_size, preserve = self.size)? It seems a good default to allow just resize(123), with preserving everything by default?
Actions #1

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)
Actions #2

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)
Actions #3

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)

Updated by mame (Yusuke Endoh) about 3 years ago

  • Assignee set to ioquatix (Samuel Williams)
Actions #5

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)
Actions #6

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)
Actions #7

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)
Actions #8

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)

(Got the latest master, updated the problems list :))

Actions #10

Updated by zverok (Victor Shepelev) about 3 years ago

  • Description updated (diff)

Updated by ioquatix (Samuel Williams) about 3 years ago

Thanks so much for this awesome list of things we should try to address.

https://github.com/ruby/ruby/pull/5303

We are considering how to_str should work. Implicit conversion is useful but it could also introduce issues.

Another concern about #to_str naming is it is one of the main parts of the buffer interface (alongside #copy), but doesn't look this way, and looks like some utility method instead. Maybe copy/to_str pair should be symmetrical, like write/read, or something like that?

We need to be a little bit careful because string operations are not the primary point, and we have to also consider that such names clash with IO-like behaviour, e.g. we might use buffer.write(io, ...) or buffer.write_to(io, ...)...

Let me first try to address all the bugs, then with the remaining time let's work on features/changes. This interface is marked experimental for a reason and I expect there will be some big changes between 3.1. release and 3.2 release in terms of the user facing interface so while I'd prefer we get this all sorted, I'm not too concerned if we don't have all the answers before the release of 3.1.

Updated by ioquatix (Samuel Williams) about 3 years ago

Just for reference, link to the latest draft of the absolutely amazing documentation: https://github.com/ruby/ruby/pull/5302

Updated by ioquatix (Samuel Williams) about 3 years ago

(2) I am not sure that type names (:U8, :S8, :U16, :u16 and so on) introduced "from scratch" (?) is a good thing. From one point, they are kinda nice, and consistent. From another, we already have abbreviated names for the types in Array#pack, and having two completely unrelated sets of designation for the same thing in the core of the language seems weird to me. I understand that not all #packs designations can be represented by Symbol, and their consistency is questionable (it is S for 16-bit unsigned LE, s for 16-bit signed LE, and S>/s> for the same in BE), but the discrepancy is questionable too.

These type names are based on a common convention which I've seen is used across different languages/environments. Linux kernel follows the exact convention to use __s32 and __u32 which is where these names came from, since I'm used to seeing them frequently in source code for precise bit widths and signed/unsigned. Rust has some similar conventions.pack and unpack are not obvious and I end up having to look up the documentation every time.

It's also entirely possible we will introduce packed formats, e.g. :u1, etc for efficient unpacking of data, so specifying the exact bit width which corresponds to say, an RFC's definition for a binary structure.

# +---------------+
# |Pad Length? (8)|
# +-+-------------+-----------------------------------------------+
# |E|                 Stream Dependency? (31)                     |
# +-+-------------+-----------------------------------------------+
# |  Weight? (8)  |
# +-+-------------+-----------------------------------------------+
# |                   Header Block Fragment (*)                 ...
# +---------------------------------------------------------------+
# |                           Padding (*)                       ...
# +---------------------------------------------------------------+

FrameHeader = IO::Buffer::Packed[
  pad: :u8,
  exclusive: :u1,
  stream_dependency: :u31,
  weight: :u8,
]

Hopefully this gives you some more insights into what I'm thinking and then we can see if it still makes sense.

Updated by zverok (Victor Shepelev) about 3 years ago

We need to be a little bit careful because string operations are not the primary point

Interesting! Because I started to document most of the usage from the point of view of some code mostly using get and set, but then I looked into your examples of its intended here, and for what I can get from there, it is:

  • uring-based C example uses rb_io_buffer_get_mutable/rb_io_buffer_get_immutable to have a direct access to a memory represented by the buffer
  • epoll-based C example does the same
  • select-based Ruby example uses copy/to_str—that's where I got the idea that it is the main interface (at least from Ruby PoV)

As rb_io_buffer_get_mutable/rb_io_buffer_get_immutable aren't exposed on Ruby level at all (IDK, maybe we do want to mention the "additional" API available for C extensions? It is kinda important), that leaves the copy/to_str` as the "main" ones at this level.

we have to also consider that such names clash with IO-like behaviour

I agree read/write are not the best choice of words probably, but I still believe the choice should be symmetrical... Maybe something "dumb" (acceptable for rarely used low-level APIs?) like get_chunk and put_chunk, or something?..

These type names are based on a common convention which I've seen is used across different languages/environments.

I agree these names are better! I am just spotting a discrepancy and reporting it, if it is considered OK by core devs (which I am not even one of :)), so be it. I don't have a useful proposal here, because changing Array#pack convention is impossible, and using that old convention for IO::Buffer is not convenient either.

Updated by ioquatix (Samuel Williams) about 3 years ago

Yes that all makes sense.

After spending some time today working on it, I decided I don't like the mutable/immutable terminology. The problem is, even if the buffer is "immutable", the underlying data can still change. This is the unfortunate reality of how C/von Neumann computer works.

So, I'm going to rename it to readonly which In think better conveys the meaning of "A view of a buffer which cannot be used to modify it."

Hopefully this will be merged in the linked PR shortly.

Updated by ioquatix (Samuel Williams) about 3 years ago

  • Status changed from Open to Closed

All the issues here have been addressed, except for:

(1) get_value / set_value type names have not changed, but the method name has changed.
(2) Using flags to indicate allocation strategy. I will review it and document the expected behaviour from Ruby.

Because this interface is marked as experimental, I'm not too concerned about either.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0