Project

General

Profile

Actions

Feature #19314

closed

String#bytesplice should support partial copy

Feature #19314: String#bytesplice should support partial copy

Added by shugo (Shugo Maeda) almost 3 years ago. Updated almost 3 years ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:111674]

Description

String#bytesplice should support partial copy without temporary String objects.

For example, given x = "0123456789", either of the following replaces the contents of x with "0167856789":

x.bytesplice(2, 3, x, 6, 3)
x.bytesplice(2..4, x, 6..8)

Considerations

  • What should be the return value?
    • The return value should be the whole source string for performance and consistency with bytesplice(offset, len, s).
  • Can the source and destination ranges overlap?
    • Yes.
  • Can the source and destination lengths be different?
    • Yes.
  • Can range form and offset/length form be mixed in the source and destination?
    • No.
  • What should happen when any offset doesn't land on character boundary in text strings.
    • IndexError should be raised.
  • Can the length be omitted in the destination?
    • Maybe yes, but it may be confusing.

Use cases


Related issues 1 (1 open0 closed)

Related to Ruby - Feature #19315: Lazy substrings in CRubyOpenActions

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #1 [ruby-core:111677]

I think this is too hard to read and parse for a human and 5 arguments seems way too much for a core method.
It feels like a full memcpy/arraycopy which I don't think in general is a good idea for String.
The implementation complexity in []= and similar already hurts Ruby too much.

This is probably the 3rd or more workaround I see to have proper lazy substrings in CRuby, i.e., "abcdef"[1..3] must not copy bytes.
That is what needs to be solved (it already works in TruffleRuby).
Yes, it means RSTRING_PTR() might need to allocate to \0-terminate, so be it, it's worth it.

So I am strongly against this, it's a nth workaround for something simpler to solve which is much more helpful in general.

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #2

Updated by naruse (Yui NARUSE) almost 3 years ago Actions #3 [ruby-core:111680]

I agree that this is a workaround and a VM should solve this as an optimization.

But your proposal: Lazy substrings is not a solution because it also creates an object especially for small strings which is embedded in RVALUE.

I agree that this is memcpy/arraycopy.
Therefore this proposal should add a description how large this workaround contributes performance in such use cases as memcpy on Ruby.

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #4 [ruby-core:111683]

naruse (Yui NARUSE) wrote in #note-3:

But your proposal: Lazy substrings is not a solution because it also creates an object especially for small strings which is embedded in RVALUE.

Yes it creates a String instance reusing the same buffer.
That shouldn't cost much compared to copying many bytes.
It should be insignificant on a benchmark with a long string to copy/move, for a short string perf shouldn't matter much anyway (it won't the be bottleneck of the program).

If it's still too much overhead, it sounds like allocations in CRuby need to be better optimized, or escape analysis should be implemented.
Again, those 2 are more general and benefits are much wider than this one method change that would be used for very few Ruby programs and only handles one specific case.

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #5 [ruby-core:111684]

Ah, something I missed though is that with lazy substrings, there would still need to be a copy of the bytes to "unshare" the string when writing to it.
That copy would also be needed if the string was shared before (e.g. with .dup), but that's unknown in our case.
This does depend on how sharing is implemented, maybe CRuby can see it's only String instances sharing that buffer, and actually both strings are involved in this operation and so there is only need to copy the bytes of the substring.

It feels like a full memcpy/arraycopy which I don't think in general is a good idea for String.

To expand on that, I dislike that because it's using String as a byte array.
If anything, such operation should be supported on Array before String.

Now that we have IO::Buffer and there is https://docs.ruby-lang.org/en/master/IO/Buffer.html#method-i-copy, why not use that?

Updated by Eregon (Benoit Daloze) almost 3 years ago Actions #6 [ruby-core:111685]

Eregon (Benoit Daloze) wrote in #note-5:

Now that we have IO::Buffer and there is https://docs.ruby-lang.org/en/master/IO/Buffer.html#method-i-copy, why not use that?

So this does what you want I believe:

x = "0123456789"
IO::Buffer.for(x) do |buffer|
  buffer.copy(buffer, 2, 3, 6)
end
p x # => "0167856789"

I think there is no need to change String#bytesplice therefore (there is even not a need for String#bytesplice due to that, which I think we shouldn't have added).
And IO::Buffer seems better suited for byte-buffer-like operations.

Updated by naruse (Yui NARUSE) almost 3 years ago Actions #7 [ruby-core:111720]

That shouldn't cost much compared to copying many bytes.

This proposal shows two use cases: text editor and NAT, which doesn't copy many bytes.

Updated by shugo (Shugo Maeda) almost 3 years ago Actions #8

  • Status changed from Open to Closed

Applied in changeset git|f7b72462aa27716370c6bea1f2c240983aca9a55.


String#bytesplice should return self

In Feature #19314, we concluded that the return value of String#bytesplice
should be changed from the source string to the receiver, because the source
string is useless and confusing when extra arguments are added.

This change should be included in Ruby 3.2.1.

Updated by shugo (Shugo Maeda) almost 3 years ago Actions #9

  • Status changed from Closed to Open

Updated by naruse (Yui NARUSE) almost 3 years ago Actions #11

  • Status changed from Open to Closed

Applied in changeset git|373e62248c9dceb660e95f1cf05fa2a4a469cd64.


merge revision(s) f7b72462aa27716370c6bea1f2c240983aca9a55: [Backport #19356]

    String#bytesplice should return self

    In Feature #19314, we concluded that the return value of String#bytesplice
    should be changed from the source string to the receiver, because the source
    string is useless and confusing when extra arguments are added.

    This change should be included in Ruby 3.2.1.
    ---
     string.c                 | 4 ++--
     test/ruby/test_string.rb | 2 +-
     2 files changed, 3 insertions(+), 3 deletions(-)
Actions

Also available in: PDF Atom