Project

General

Profile

Feature #13626

Add String#byteslice!

Added by ioquatix (Samuel Williams) over 2 years ago. Updated about 1 year ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:81544]

Description

It's a common pattern in IO buffering, to read a part of a string while leaving the remainder.

# Consume only part of the read buffer:
result = @read_buffer.byteslice(0, size)
@read_buffer = @read_buffer.byteslice(size, @read_buffer.bytesize)

It would be nice if this code could be simplified to:

result = @read_buffer.byteslice!(size)

Additionally, this allows a significantly improved implementation by the interpreter.

History

Updated by normalperson (Eric Wong) over 2 years ago

samuel@oriontransfer.org wrote:

https://bugs.ruby-lang.org/issues/13626

I used to want this, too; but then I realized IO#read and
similar methods will always return a binary string when given a
length limit.

So String#slice! should be enough.

(And IO#read and friends without a length limit is suicidal, anyways :)

Updated by ioquatix (Samuel Williams) over 2 years ago

Thanks for that idea.

If that's the case, when appending to the write buffer:

write_buffer = String.new.b
unicode_string = "\u1234".force_encoding("UTF-8")
write_buffer << unicode_string
write_buffer.encoding # Changed from ASCII-8BIT to Encoding:UTF-8

The only way I can think to fix this is to run +force_encoding+ on the write buffer after every append but this seems hugely inefficient.

Ideas?

Updated by normalperson (Eric Wong) over 2 years ago

samuel@oriontransfer.org wrote:

Thanks for that idea.

If that's the case, when appending to the write buffer:

write_buffer = String.new.b
unicode_string = "\u1234".force_encoding("UTF-8")
write_buffer << unicode_string
write_buffer.encoding # Changed from ASCII-8BIT to Encoding:UTF-8

The only way I can think to fix this is to run +force_encoding+ on the write buffer after every append but this seems hugely inefficient.

Ideas?

String#force_encoding is done in-place so it should not be
that slow, the String#<< would be the slow part since it
involves at least one memcpy (worst case is realloc + 2 memcpy)

But I'm not sure why you would want to be setting data to
UTF-8; I guess you got it from some 3rd-party library?

Maybe String#b! could be shorter alias for
force_encoding(Encoding::UTF_8); but yeah, exposing writev via
[Feature #9323] is probably the best option, anyways.

Fwiw, I'm also not convinced String#<< behavior about changing
write_buffer to Encoding::UTF-8 in your above example is good
behavior on Ruby's part... But I don't know much about human
language encodings, I am just a *nix plumber where a byte is a
byte.

Updated by ioquatix (Samuel Williams) over 2 years ago

Fwiw, I'm also not convinced String#<< behavior about changing
write_buffer to Encoding::UTF-8 in your above example is good
behavior on Ruby's part...

Agreed.

Updated by matz (Yukihiro Matsumoto) almost 2 years ago

Sounds OK to me.

Matz.

Updated by akr (Akira Tanaka) almost 2 years ago

At the developer meeting, we discuss that byteslice! and byteslice method should take same arguments.

Updated by duerst (Martin Dürst) almost 2 years ago

normalperson (Eric Wong) wrote:

Fwiw, I'm also not convinced String#<< behavior about changing
write_buffer to Encoding::UTF-8 in your above example is good
behavior on Ruby's part... But I don't know much about human
language encodings, I am just a *nix plumber where a byte is a
byte.

This behavior may not be the best for this specific case, but in general, if one string is US-ASCII, and the other is UTF-8, then UTF-8 is a superset of US-ASCII, and concatenating the two will produce a string in UTF-8. Dropping the encoding would loose important information.

Please also note that you are actually on dangerous ground here. The above only works because the string doesn't contain any non-ASCII (high bit set) bytes. As soon as there is such a byte, there will be an error.

s = "abcde".b
s.encoding   # => #<Encoding:ASCII-8BIT>
s << "αβγδε" # => "abcdeαβγδε"
s.encoding   # => #<Encoding:UTF-8>

but:

t = "αβγδε".b # => "\xCE\xB1\xCE\xB2\xCE\xB3\xCE\xB4\xCE\xB5"
t.encoding    # => #<Encoding:ASCII-8BIT>
t << "λμπρ"   # => Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8

So if you have an ASCII-8BIT buffer, and want to append something, always make sure you make the appended stuff also ASCII-8BIT.

Updated by ioquatix (Samuel Williams) over 1 year ago

If you round trip UTF-8 to ASCII-8BIT and back again, the result should be the same IMHO. It's just the interpretation of the bytes which is different, but the underlying data should be the same. I still think adding String#byteslice! is a good idea. Has there been any progress?

Updated by ioquatix (Samuel Williams) over 1 year ago

By the way, I ended up implementing https://github.com/socketry/async-io/blob/master/lib/async/io/binary_string.rb which I guess is okay but it's not ideal.

Updated by janko (Janko Marohnić) about 1 year ago

I support adding String#byteslice!. I've been using String#byteslice in custom IO-like objects that implement IO#read semantics, as the strings I work with don't necessarily have to be in binary encoding (otherwise I'd just use String#slice), they can also be in UTF-8. Since IO#read needs to work in terms of bytes, that's why I needed String#byteslice.

I've used the exact idiom from Samuel's original description in three different projects already:

String#byteslice! would allow reducing the code and probably end up with fewer strings at the end.

Also available in: Atom PDF