Project

General

Profile

Actions

Feature #20594

closed

A new String method to append bytes while preserving encoding

Added by byroot (Jean Boussier) 7 months ago. Updated 4 months ago.

Status:
Closed
Target version:
-
[ruby-core:118388]

Description

Context

When working with binary protocols such as protobuf or MessagePack, you may often need to assemble multiple
strings of different encoding:

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf <<
      255 << title.bytesize << title <<
      255 << body.bytesize << body
  end
end

Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>

The problem in the above case, is that because Encoding::ASCII_8BIT is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:

Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)

In many cases, you want to append to a String without changing the receiver's encoding.

The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.

Previous discussion

There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975

Existing solutions

You can of course always cast the strings you append to avoid this problem:

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf <<
      255 << title.bytesize << title.b <<
      255 << body.bytesize << body.b
  end
end

But this cause a lot of needless allocations.

You'd think you could also use bytesplice, but it actually has the same issue:

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf << 255 << title.bytesize
    buf.bytesplice(buf.bytesize, title.bytesize, title)
    buf << 255 << body.bytesize
    buf.bytesplice(buf.bytesize, body.bytesize, title)
  end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)

And even if it worked, it would be very unergonomic.

Proposal: a byteconcat method

A solution to this would be to add a new byteconcat method, that could be shimed as:

class String
  def byteconcat(*strings)
    strings.map! do |s|
      if s.is_a?(String) && s.encoding != encoding
        s.dup.force_encoding(encoding)
      else
        s
      end
    end
    concat(*strings)
  end
end

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf.byteconcat(
      255, title.bytesize, title,
      255, body.bytesize, body,
    )
  end
end

Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>

But of course a builtin implementation wouldn't need to dup the arguments.

Like other byte* methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.

Method name and signature

Name

This proposal suggests String#byteconcat, to mirror String#concat, but other names are possible:

  • byteappend (like Array#append)
  • bytepush (like Array#push)

Signature

This proposal makes byteconcat accept either String or Integer (in char range) arguments like concat. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.

The proposed method also accept variable arguments for consistency with String#concat, Array#push, Array#append.

The proposed method returns self, like concat and others.

YJIT consideration

I consulted @maximecb (Maxime Chevalier-Boisvert) about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.


Related issues 2 (0 open2 closed)

Related to Ruby master - Feature #14975: String#append without changing receiver's encodingRejectedioquatix (Samuel Williams)Actions
Related to Ruby master - Bug #15460: Behaviour of String#setbyte changedClosedshyouhei (Shyouhei Urabe)Actions
Actions #1

Updated by byroot (Jean Boussier) 7 months ago

  • Related to Feature #14975: String#append without changing receiver's encoding added

Updated by Eregon (Benoit Daloze) 7 months ago

+1 from me.

I forgot that String#concat accepted integers.
Together with accepting variable arguments, it feels like this new method would be quite a bit complicated to implement and optimize well.
But, it's consistent with String#concat so I think it makes sense as proposed.

Updated by maximecb (Maxime Chevalier-Boisvert) 6 months ago

I consulted @maximecb (Maxime Chevalier-Boisvert) (Maxime Chevalier-Boisvert) about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.

Yeah my personal preference would be for something like bytepush or byteappend that accepts a single argument, the simplest possible method, that is kept as strict as possible, while still getting the job done.

The technical reason being that in order to handle multiple arguments, we have to essentially unroll the loop in the YJIT codegen, and speculate that the arguments will all keep the same type at run-time. This is likely to be true, but it means that we have to speculate on many things at once and then generate a big piece of machine code all at the same time.

I understand that there is some temptation to keep the API more similar to other String methods with a similar purpose, but IMO this method is already unusual because it's essentially an unsafe operation (no encoding validation). The reason we're discussing this method in the first place is performance, and so maybe performance/simplicity should be the main concern.

If the only way this change will be allowed to happen is to allow a variable argument count, then OK I guess, but I would like to push for something more simple, less dynamic. We sort of have the benefit of hindsight here, but take the Ruby binding API for example. If we were designing Ruby today, there is no way we would choose to make binding as powerful and unrestricted as it is. It's much easier to remove restrictions later than to add them back in.

Updated by matz (Yukihiro Matsumoto) 6 months ago

  • Assignee set to byroot (Jean Boussier)

I agree with having such a method. However, I disagree with the name byteconcat. Since other methods with byte prefixes have the behavior of counting the index in bytes, but this method has nothing to do with the index, but with the encoding.
If this method only works with BINARY encoding, the name might be binary_concat or binconcat. In the developers' meeting, someone proposed force_concat.

Matz.

Updated by byroot (Jean Boussier) 6 months ago

Since other methods with byte prefixes have the behavior of counting the index in bytes, but this method has nothing to do with the index

I kinda disagree here. My understanding is that the byte* methods operate on bytes rather than characters, so size != bytesize.

Here the idea is to concat bytes, not characters so I think it fits.

If this method only works with BINARY encoding,

It doesn't no. It can work for any encoding, you can use it for instance to assemble an UTF-8 string from some network reads:

buf = +"" # UTF-8
while chunk = io.read(1024) # ASCII-BIT
  buf.byteconcat(chunk)
  buf.valid_encoding? # May be true or false 
end
buf

Updated by byroot (Jean Boussier) 6 months ago

Discussed in person in the meeting. I'll think about other names and propose some.

Updated by mame (Yusuke Endoh) 6 months ago

Existing methods with byte-prefix (String#byteindex, #bytesplite, etc.) mean that the unit of offset or size is in byte.

byteconcat and byteappend, on the other hand, are confusing because they have no offset or size, but they mean concatenation without regard to encodings.

There were some alternative proposals that came out of the dev meeting.

  • force_concat (matz: it lacks "encoding", I don't like it so much)
  • binary_concat (It should work only when the receiver's encoding is BINARY. Does it fit with @byroot (Jean Boussier) 's motivation?)

Updated by byroot (Jean Boussier) 6 months ago

binary_concat (It should work only when the receiver's encoding is BINARY. Does it fit with @byroot (Jean Boussier) (Jean Boussier) 's motivation?)

No it's too limiting. It should work with any encoding in my opinion.

I'll try to come up with other names. So far I'm thinking of String#append_bytes(String).

Updated by Eregon (Benoit Daloze) 6 months ago

mame (Yusuke Endoh) wrote in #note-7:

Existing methods with byte-prefix (String#byteindex, #bytesplite, etc.) mean that the unit of offset or size is in byte.

My understanding of byte* methods is they treat the String as a byte array, which implies indices are just byte indices but also that the encoding is ignored (it seems clear when one does "é".getbyte(0)).
It's (almost) as-if the string had the BINARY encoding for the duration of the operation, but without the overhead to switch to BINARY and back (which notably could cause some extra code range computation, etc).
BTW, I would consider each_byte also a byte* method, and that one does not accept or pass byte indices.

So I think it would make sense to extend the meaning of byte* methods to be a little more general, just like I explained above.
I don't think it was documented to be only about byte indices either.

That said, I think String#append_bytes(String) sounds fine too.

Updated by byroot (Jean Boussier) 6 months ago

but also that the encoding is ignored

That was my understanding as well, but given that bytesplice does potentially raise EncodingError, there is a mismatch here.

In this case I very much want to allow breaking a string encoding.

Updated by Eregon (Benoit Daloze) 6 months ago

I wonder if that's a bug of bytesplice.
It's not like e.g. setbyte would respect the encoding (it cannot possibly).
OTOH byteindex also takes a String argument and raises Encoding::CompatibilityError for "é".b.byteindex "é" so I guess it's not so clear-cut currently (I get the replies above a bit better).

Another way to see the new method is concatenation without encoding negotiation/checking encoding compatibility.
Having byte/bytes in the name seems a good way to express that.

Updated by shugo (Shugo Maeda) 6 months ago

Eregon (Benoit Daloze) wrote in #note-11:

I wonder if that's a bug of bytesplice.
It's not like e.g. setbyte would respect the encoding (it cannot possibly).
OTOH byteindex also takes a String argument and raises Encoding::CompatibilityError for "é".b.byteindex "é" so I guess it's not so clear-cut currently (I get the replies above a bit better).

It's intended behavior.
Having a long history of suffering from character encoding issues such as mojibake, we Japanese believe that operations on strings with different encodings should be approached with caution.

byte- methods could be considered exceptions, but at present, they are not. That's why Matz doesn't like the name byteconcat.

Updated by Dan0042 (Daniel DeLorme) 6 months ago

From the dev meeting log these two points stuck out at me:

  • nobu: can we use IO::Buffer?
  • shyouhei: why not StringIO?

These are good questions that I also wondered about, but the answers are not recorded in the meeting log. What is the reason not to use these APIs which are pretty much designed for this exact use case? Why introduce a third API?

BTW I have to say I find it extremely ironic that we went to so much trouble to migrate from byte-oriented strings in 1.8 to character-oriented strings in 1.9, and now we're re-adding byte-oriented methods.

Updated by alanwu (Alan Wu) 6 months ago

What is the reason not to use these APIs which are pretty much designed for this exact use case? Why introduce a third API?

Sometimes only String gives you the most convenience and efficiency because of API interoperability. For example, binary protocols often have you reaching for Array#pack and String#unpack, and these don't directly work with StringIO and IO::Buffer.

Updated by byroot (Jean Boussier) 6 months ago

@Dan0042 I already answered your same question in [Feature #20394]. Whether we like it or not neither StringIO nor IO::Buffer are fast, compatible and convenient enough to replace String for these use case.

If you don't believe me, I encourage you to try using either in https://github.com/Shopify/protoboeuf or https://github.com/redis-rb/redis-client (or any other gem of your choice) and see how that goes.

Updated by byroot (Jean Boussier) 5 months ago

Ok, so after thinking about this for a bit, I think a good name would be:

  • String#append_bytes(String) => self
  • String#append_byte(Integer) => self

It's still mentioning bytes, but it's not using the same byte* prefix as other methods, so I think it's different enough. If anything mentioning byte is deemed to confusing I can search or something else, but I really think that's the proper name for the concept.

Updated by duerst (Martin Dürst) 5 months ago

Eregon (Benoit Daloze) wrote in #note-9:

My understanding of byte* methods is they treat the String as a byte array, which implies indices are just byte indices but also that the encoding is ignored (it seems clear when one does "é".getbyte(0)).

This may need a completely separate issue, but when I introduced String#force_encoding, I was imagining adding a block to it so that the forced encoding would only apply inside the block.

s = "your string (e.g. UTF-8 or whatever) here"
s.force_encoding(Encoding::BINARY) {
  # binary operations go here
}
s.encoding # => Encoding:UTF-8

This would streamline things quite a bit; no need for separate byte* methods. Using a block should also help because in many use cases, there's be more than just one byte* method.

The question then would of course be how to optimize things. But to a large part, the optimizations are actually already done, because different encodings use different primitives.

Updated by Dan0042 (Daniel DeLorme) 5 months ago

duerst (Martin Dürst) wrote in #note-17:

This may need a completely separate issue, but when I introduced String#force_encoding, I was imagining adding a block to it so that the forced encoding would only apply inside the block.

+1, that would be very convenient, but yes it's a different issue because it doesn't ignore Encoding::CompatibilityError which is the point of this proposal.

Many languages have a ByteArray or ByteBuffer separate from String, and I believe this is what ruby needs. Design-wise, in the long run, I believe it's better to evolve StringIO or IO::Buffer into a proper bytebuffer to handle encoding-agnostic bytes; and String should only handle encoding-aware characters. (I apologize for beating a dead horse.)

Updated by matz (Yukihiro Matsumoto) 5 months ago

append_bytes seems OK for me. Could you design the concrete behavior of the method:

  • does it take more than one argument?
  • does it take integers too?
  • what is the result of the method encoding-wise?
  • etc.

Matz.

Updated by byroot (Jean Boussier) 5 months ago

Thank you Matz.

I opened a PR that implement the method as envisioned: https://github.com/ruby/ruby/pull/11293

does it take more than one argument?

No, a single argument, and only a String (T_STRING) as suggested by YJIT people to make it optimizable.

does it take integers too?

Similarly, YJIT people suggested to not take different types. Hence why I proposed two methods append_byte(Integer) and append_bytes(String).

what is the result of the method encoding-wise?

The receiver encoding is never changed, even if it means that it's encoding become invalid. It's the called responsability to check String#valid_encoding? if that's a possibility and to deal with it.

It also means append_bytes never raises an Encoding::CompatibilityError.

Updated by alanwu (Alan Wu) 5 months ago

Let me present an alternative design that only adds one method. The name is
String#append_as_bytes, and the name provides a framing of "reinterpretation"
that helps to explain the behavior of the method.

call-seq:
  append_as_bytes(*objects) -> self

Interpret arguments as bytes and append them to +self+ without changing the
encoding of +self+.

For each object that is a String, append the bytes of the string to +self+. For
each Integer object +i+, append a byte that is the bitwise AND of +i+ and
+0xff+. If any other type of objects is in +objects+, leave +self+ unmodified
and raise ArgumentError. This method does not attempt to implicitly convert any
arguments.

Examples:

  7z_signature = ''.b
  7z_signature.append_bytes('7z', 0xbc, 0xaf, 0x27, 0x1c) #=> "7z\xBC\xAF'\x1C"

It's clear from the name that the method has its own interpretation of
arguments. That gives a hint that it does something unusual, as it breaks
away from the default, "bytes with an encoding" stance of strings. It also
grammatically stands out from existing *byte* methods, a reflection of the
differences in behavior.

For string arguments, it's the same as append_bytes(String) from byroot. For
integers, the i & 0xff masking behavior comes from String#setbyte. Note that
it masks without making calls.

  -128 & 0xff                         # => 128
  "x".tap{ _1.setbyte(0, -128)}.bytes # => [128]

This masking is how it interprets an integer as a byte.

The method does not accept arrays for simplicity, as splatting is already
available as a flexible option for callers.

I think this design strikes a good balance between usability, efficiency, and
how well compilers can handle it.

Updated by byroot (Jean Boussier) 5 months ago

The name is String#append_as_bytes, and the name provides a framing of "reinterpretation" that helps to explain the behavior of the method.

I like that name, I think it clears out any possible confusion.

append_as_bytes(*objects)

I retired the *objects from my proposal because I've been told it'd be too hard to optimize, but if you believe it's no problem, I definitely prefer it as it's much more Ruby like, and was my initial proposal.

For integers, the i & 0xff masking behavior comes from String#setbyte.

I agree it makes sense to mirror setbyte rather than String#<<(Integer) here.

Updated by tenderlovemaking (Aaron Patterson) 5 months ago

No opinion on method name / API, but we've verified effectiveness of this optimization using our protobuf implementation and you can see the results here.

Updated by mame (Yusuke Endoh) 5 months ago

Alan's proposal looks good to me. I don't think it is Ruby to design such an ordinary API with extreme care for optimization.

In fact, the same design as Alan's proposal was discussed at the dev meeting (sorry it was not written in the meeting log). The one difference between it and Alan's is that an exception should be thrown for Integers other than 0..255 until we have a convincing use case. I personally think it is reasonably convincing that it is the same as setbyte, though.

Updated by matz (Yukihiro Matsumoto) 4 months ago

String#append_as_bytes looks good to me too. Accepted.

Matz.

Actions #26

Updated by byroot (Jean Boussier) 4 months ago

  • Status changed from Open to Closed

Applied in changeset git|16f241f0aa047ed77ccea6b6c361b421a72d0454.


Implement String#append_as_bytes(String | Integer, ...)

[Feature #20594]

A handy method to construct a string out of multiple chunks.

Contrary to String#concat, it doesn't do any encoding negociation,
and simply append the content as bytes regardless of whether this
result in a broken string or not.

It's the caller responsibility to check for String#valid_encoding?
in cases where it's needed.

When passed integers, only the lower byte is considered, like in
String#setbyte.

Updated by Dan0042 (Daniel DeLorme) 4 months ago

For integers, the i & 0xff masking behavior comes from String#setbyte.

I agree it makes sense to mirror setbyte rather than String#<<(Integer) here.

Can anyone explain what is the use case for this behavior? It seems strange to me that silently truncating integers would be a desirable behavior. Especially given the example above append_as_bytes(title.bytesize, title) would result in corrupted output if title is larger than 255 bytes.

Updated by tenderlovemaking (Aaron Patterson) 4 months ago

Dan0042 (Daniel DeLorme) wrote in #note-27:

For integers, the i & 0xff masking behavior comes from String#setbyte.

I agree it makes sense to mirror setbyte rather than String#<<(Integer) here.

Can anyone explain what is the use case for this behavior? It seems strange to me that silently truncating integers would be a desirable behavior. Especially given the example above append_as_bytes(title.bytesize, title) would result in corrupted output if title is larger than 255 bytes.

I don't think there is a specific use case (or maybe the use case is the same as when providing out-of-bounds numbers to setbyte?). To me the question is "what to do with out-of-bounds input?", and it seems setbyte will just truncate bad input so append_as_bytes is doing the same.

Updated by Dan0042 (Daniel DeLorme) 4 months ago

I understand the idea of keeping things consistent. But why does setbyte have this behavior in the first place? A few years back there was an attempt to check bounds, and then in #15460 Matz apparently decided he "prefers mod 256 behaviour for larger numbers". But there's no context or explanation for this preference, so I'm left mystified.

Updated by shyouhei (Shyouhei Urabe) 4 months ago

Dan0042 (Daniel DeLorme) wrote in #note-29:

I understand the idea of keeping things consistent. But why does setbyte have this behavior in the first place? A few years back there was an attempt to check bounds, and then in #15460 Matz apparently decided he "prefers mod 256 behaviour for larger numbers". But there's no context or explanation for this preference, so I'm left mystified.

I remember that in #15460 @gettalong showed us a use case in his HexaPDF when modulo behaviour is in fact useful. Matz was persuaded by that.

Updated by Dan0042 (Daniel DeLorme) 4 months ago

Ironically, HexaPDF now uses an explicit modulo 256, so it would actually be fine if #setbyte raised an out-of-bound error.
https://github.com/gettalong/hexapdf/blob/master/lib/hexapdf/filter/predictor.rb#L166

imho better safe than sorry. It's very easy to add a modulo when needed, so the built-in modulo 256 has limited usefulness. The opposite, adding a range check in ruby, is much more cumbersome. Just my 2¢

Updated by gettalong (Thomas Leitner) 4 months ago

Dan0042 (Daniel DeLorme) wrote in #note-31:

Ironically, HexaPDF now uses an explicit modulo 256, so it would actually be fine if #setbyte raised an out-of-bound error.
https://github.com/gettalong/hexapdf/blob/master/lib/hexapdf/filter/predictor.rb#L166

Yes, as mentioned in the linked issue #15460 I implemented this so that HexaPDF remained compatible with 2.6.0. Since I dropped support for Ruby 2.6 in HexaPDF at the start of 2024, I could now change it back again.

I don't have a strong opinion on whether setbyte or now append_as_bytes should do modulo 256 or bounds checking, I was more concerned with compatibility to previous versions back then.

Updated by Eregon (Benoit Daloze) 4 months ago

I agree with @Dan0042, the new API should not have this error-prone behavior, it should raise if out of bounds instead of silent truncation.

I also think we should fix setbyte to not have this behavior, but that's a different issue.
There it might be slightly less error prone as setbyte is clearly meant to write a single byte.
Still, silent data truncation is always a bad idea.

Actions #34

Updated by Eregon (Benoit Daloze) 4 months ago

  • Related to Bug #15460: Behaviour of String#setbyte changed added
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0