Project

General

Profile

Actions

Feature #20594

open

A new String method to append bytes while preserving encoding

Added by byroot (Jean Boussier) 4 days ago. Updated 3 days ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:118388]

Description

Context

When working with binary protocols such as protobuf or MessagePack, you may often need to assemble multiple
strings of different encoding:

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf <<
      255 << title.bytesize << title <<
      255 << body.bytesize << body
  end
end

Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>

The problem in the above case, is that because Encoding::ASCII_8BIT is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:

Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)

In many cases, you want to append to a String without changing the receiver's encoding.

The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.

Previous discussion

There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975

Existing solutions

You can of course always cast the strings you append to avoid this problem:

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf <<
      255 << title.bytesize << title.b <<
      255 << body.bytesize << body.b
  end
end

But this cause a lot of needless allocations.

You'd think you could also use bytesplice, but it actually has the same issue:

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf << 255 << title.bytesize
    buf.bytesplice(buf.bytesize, title.bytesize, title)
    buf << 255 << body.bytesize
    buf.bytesplice(buf.bytesize, body.bytesize, title)
  end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)

And even if it worked, it would be very unergonomic.

Proposal: a byteconcat method

A solution to this would be to add a new byteconcat method, that could be shimed as:

class String
  def byteconcat(*strings)
    strings.map! do |s|
      if s.is_a?(String) && s.encoding != encoding
        s.dup.force_encoding(encoding)
      else
        s
      end
    end
    concat(*strings)
  end
end

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf.byteconcat(
      255, title.bytesize, title,
      255, body.bytesize, body,
    )
  end
end

Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>

But of course a builtin implementation wouldn't need to dup the arguments.

Like other byte* methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.

Method name and signature

Name

This proposal suggests String#byteconcat, to mirror String#concat, but other names are possible:

  • byteappend (like Array#append)
  • bytepush (like Array#push)

Signature

This proposal makes byteconcat accept either String or Integer (in char range) arguments like concat. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.

The proposed method also accept variable arguments for consistency with String#concat, Array#push, Array#append.

The proposed method returns self, like concat and others.

YJIT consideration

I consulted @maximecb (Maxime Chevalier-Boisvert) about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.


Related issues 1 (0 open1 closed)

Related to Ruby master - Feature #14975: String#append without changing receiver's encodingRejectedioquatix (Samuel Williams)Actions
Actions #1

Updated by byroot (Jean Boussier) 4 days ago

  • Related to Feature #14975: String#append without changing receiver's encoding added

Updated by Eregon (Benoit Daloze) 4 days ago

+1 from me.

I forgot that String#concat accepted integers.
Together with accepting variable arguments, it feels like this new method would be quite a bit complicated to implement and optimize well.
But, it's consistent with String#concat so I think it makes sense as proposed.

Updated by maximecb (Maxime Chevalier-Boisvert) 3 days ago

I consulted @maximecb (Maxime Chevalier-Boisvert) (Maxime Chevalier-Boisvert) about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.

Yeah my personal preference would be for something like bytepush or byteappend that accepts a single argument, the simplest possible method, that is kept as strict as possible, while still getting the job done.

The technical reason being that in order to handle multiple arguments, we have to essentially unroll the loop in the YJIT codegen, and speculate that the arguments will all keep the same type at run-time. This is likely to be true, but it means that we have to speculate on many things at once and then generate a big piece of machine code all at the same time.

I understand that there is some temptation to keep the API more similar to other String methods with a similar purpose, but IMO this method is already unusual because it's essentially an unsafe operation (no encoding validation). The reason we're discussing this method in the first place is performance, and so maybe performance/simplicity should be the main concern.

If the only way this change will be allowed to happen is to allow a variable argument count, then OK I guess, but I would like to push for something more simple, less dynamic. We sort of have the benefit of hindsight here, but take the Ruby binding API for example. If we were designing Ruby today, there is no way we would choose to make binding as powerful and unrestricted as it is. It's much easier to remove restrictions later than to add them back in.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0