https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112022-02-22T09:21:42ZRuby Issue Tracking SystemRuby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=966272022-02-22T09:21:42Zshugo (Shugo Maeda)
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/96627/diff?detail_id=62076">diff</a>)</li></ul> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=966292022-02-22T10:31:25Zshugo (Shugo Maeda)
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/96629/diff?detail_id=62077">diff</a>)</li></ul> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=966302022-02-22T10:48:42ZEregon (Benoit Daloze)
<ul></ul><p>Shouldn't a text editor use the ropes representation for Strings instead? ( <a href="https://en.wikipedia.org/wiki/Rope_(data_structure)" class="external">https://en.wikipedia.org/wiki/Rope_(data_structure)</a> )<br>
This sounds very inefficient because bytesplice will need to copy everything after the insert if the <code>inserted_bytes.length != length</code>.</p>
<p>That's more of a personal opinion but I always found <code>splice</code> arguments and semantics confusing, also in JavaScript.<br>
<code>[]=</code> at least makes it much clearer, but <code>s.bytesplice(2, 3, "x")</code> sounds like a C API to me.<br>
If we do add this I would suggest only adding the Range version for simplicity.</p>
<p>I think for byteindex & byteoffset in <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Byte-based operations for String (Closed)" href="https://bugs.ruby-lang.org/issues/13110">#13110</a> there was good motivation, and Ruby internally would anyway need to use byte offsets so exposing those to the user seemed relatively harmless, and it needed as you showed very complex hacks.<br>
But here I question the need for it, because the code before bytesplice seems reasonable enough, i.e., the code before <a href="https://github.com/shugo/textbringer/pull/31/files" class="external">https://github.com/shugo/textbringer/pull/31/files</a> seems fine enough.<br>
It's also a very specific use case, I would like to see other use cases if we add a core method to String.</p>
<p>There are also other ways to solve this, where I think you semantically want a byte array/buffer which can be shown as text and searched:</p>
<ul>
<li>Use UTF-32LE/UTF-32BE to have constant indexing of Strings, then <code>[]=</code> works fine</li>
<li>Can the String be kept as Encoding::BINARY all the time, why does it need to be UTF-8? Can it just be reencoded to UTF-8 in the few places which really need it?</li>
<li>Do not use String and e.g. use an Array of byte values or a C extension</li>
<li>Use Ropes or similar implemented in Ruby, which would avoid extra copying and might not need to use byte offsets at all</li>
<li>Add some way to have a "cursor object" in a String, which knows both the byte index and the character index, and have its own methods, that would be much more general and could help improve the performance in far more cases (e.g., could also yield such a cursor in some <code>each_char_with_cursor</code> method). It's probably too tricky to implement correctly when the String is mutable though.</li>
</ul> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=966512022-02-22T23:40:13Zshugo (Shugo Maeda)
<ul></ul><p>Eregon (Benoit Daloze) wrote in <a href="#note-3">#note-3</a>:</p>
<blockquote>
<p>Shouldn't a text editor use the ropes representation for Strings instead? ( <a href="https://en.wikipedia.org/wiki/Rope_(data_structure)" class="external">https://en.wikipedia.org/wiki/Rope_(data_structure)</a> )<br>
This sounds very inefficient because bytesplice will need to copy everything after the insert if the <code>inserted_bytes.length != length</code>.</p>
</blockquote>
<p>In general ropes may be a good choice, but I prefer gap buffers (<a href="https://en.wikipedia.org/wiki/Gap_buffer" class="external">https://en.wikipedia.org/wiki/Gap_buffer</a>) implemented by String.<br>
Because it's simple and fast enough for regular files.<br>
It's hard to implement time-efficient data structure in pure Ruby, but the built-in class String is fast and has rich features like regular expressions.</p>
<blockquote>
<p>That's more of a personal opinion but I always found <code>splice</code> arguments and semantics confusing, also in JavaScript.<br>
<code>[]=</code> at least makes it much clearer, but <code>s.bytesplice(2, 3, "x")</code> sounds like a C API to me.<br>
If we do add this I would suggest only adding the Range version for simplicity.</p>
</blockquote>
<p>I prefer <code>[]=</code> too, but bytesplice is a low level API, so I think it's not a big issue.<br>
Only adding the Range version is acceptable for me, but we already have String#byteslice and it supports the Integer version, so it may better to support the Integer version for consistency.<br>
However, omitting the second argument length is harmful for String#bytesplice (if the default length is 1 byte, multibyte strings may get broken), so length shouldn't be omitted.</p>
<blockquote>
<p>I think for byteindex & byteoffset in <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Byte-based operations for String (Closed)" href="https://bugs.ruby-lang.org/issues/13110">#13110</a> there was good motivation, and Ruby internally would anyway need to use byte offsets so exposing those to the user seemed relatively harmless, and it needed as you showed very complex hacks.<br>
But here I question the need for it, because the code before bytesplice seems reasonable enough, i.e., the code before <a href="https://github.com/shugo/textbringer/pull/31/files" class="external">https://github.com/shugo/textbringer/pull/31/files</a> seems fine enough.<br>
It's also a very specific use case, I would like to see other use cases if we add a core method to String.</p>
</blockquote>
<p>I'd like to hear other's opinions about use cases.<br>
I agree that Textbringer is a specific use case, but it's important for me, and it's good to introduce String#bytesplice if it's useful for others.</p>
<blockquote>
<p>There are also other ways to solve this, where I think you semantically want a byte array/buffer which can be shown as text and searched:</p>
<ul>
<li>Use UTF-32LE/UTF-32BE to have constant indexing of Strings, then <code>[]=</code> works fine</li>
</ul>
</blockquote>
<p>UTF-32 is not ASCII compatible and cannot be used as a script encoding, so I prefer UTF-8.</p>
<blockquote>
<ul>
<li>Can the String be kept as Encoding::BINARY all the time, why does it need to be UTF-8? Can it just be reencoded to UTF-8 in the few places which really need it?</li>
</ul>
</blockquote>
<p>It's possible and that's what I do in the implementation of Textbringer.<br>
But such encoding changes are unnecessary if Ruby supports byte-based operations on strings with text encodings.<br>
I've heard from Naruse-san that he has the similar idea. He may have more use cases.</p>
<blockquote>
<ul>
<li>Do not use String and e.g. use an Array of byte values or a C extension</li>
</ul>
</blockquote>
<p>I wouldn't like to implement regular expressions on Array.</p>
<blockquote>
<ul>
<li>Use Ropes or similar implemented in Ruby, which would avoid extra copying and might not need to use byte offsets at all</li>
</ul>
</blockquote>
<p>I prefer String for the reasons stated above.</p>
<blockquote>
<ul>
<li>Add some way to have a "cursor object" in a String, which knows both the byte index and the character index, and have its own methods, that would be much more general and could help improve the performance in far more cases (e.g., could also yield such a cursor in some <code>each_char_with_cursor</code> method). It's probably too tricky to implement correctly when the String is mutable though.</li>
</ul>
</blockquote>
<p>Such a cursor object can be alternative of byte-based methods, but a cursor object is only valid for a particular String at a particular time, and it may bring different issues from byte-based methods.</p> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=969122022-03-18T01:43:29Zko1 (Koichi Sasada)
<ul></ul><p>Matz said "go ahead".</p> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=969142022-03-18T02:55:09Zshugo (Shugo Maeda)
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li></ul><p>Applied in changeset <a class="changeset" title="Add a NEWS entry about [Feature #18598] [ci skip]" href="https://bugs.ruby-lang.org/projects/ruby-master/repository/git/revisions/2fdfd499db489db9eb4046849aa785c3bd382761">git|2fdfd499db489db9eb4046849aa785c3bd382761</a>.</p>
<hr>
<p>Add a NEWS entry about [Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: Add String#bytesplice (Closed)" href="https://bugs.ruby-lang.org/issues/18598">#18598</a>] [ci skip]</p> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=969152022-03-18T03:01:00Zshugo (Shugo Maeda)
<ul></ul><p>String#bytesplice has been added by <a class="changeset" title="Add String#bytesplice" href="https://bugs.ruby-lang.org/projects/ruby-master/repository/git/revisions/1107839a7fed31339fc947995b7b45b8eaf4041b">git|1107839a7fed31339fc947995b7b45b8eaf4041b</a>.</p> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=1010822023-01-06T13:14:10ZEregon (Benoit Daloze)
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-1 priority-4 priority-default" href="/issues/19315">Feature #19315</a>: Lazy substrings in CRuby</i> added</li></ul> Ruby master - Feature #18598: Add String#bytesplicehttps://bugs.ruby-lang.org/issues/18598?journal_id=1010852023-01-06T13:17:46ZEregon (Benoit Daloze)
<ul></ul><p>shugo (Shugo Maeda) wrote in <a href="#note-4">#note-4</a>:</p>
<blockquote>
<blockquote>
<ul>
<li>Do not use String and e.g. use an Array of byte values or a C extension</li>
</ul>
</blockquote>
<p>I wouldn't like to implement regular expressions on Array.</p>
<blockquote>
<ul>
<li>Use Ropes or similar implemented in Ruby, which would avoid extra copying and might not need to use byte offsets at all</li>
</ul>
</blockquote>
<p>I prefer String for the reasons stated above.</p>
</blockquote>
<p>The typical approach is to flatten (or convert) the Rope to String before matching (whether the Rope is in Ruby or from the VM).<br>
I think that is good enough for a text editor.</p>