https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112017-06-03T00:32:15ZRuby Issue Tracking SystemRuby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=652502017-06-03T00:32:15Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:samuel@oriontransfer.org" class="email">samuel@oriontransfer.org</a> wrote:</p>
<blockquote>
<p><a href="https://bugs.ruby-lang.org/issues/13626" class="external">https://bugs.ruby-lang.org/issues/13626</a></p>
</blockquote>
<p>I used to want this, too; but then I realized IO#read and<br>
similar methods will always return a binary string when given a<br>
length limit.</p>
<p>So String#slice! should be enough.</p>
<p>(And IO#read and friends without a length limit is suicidal, anyways :)</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=652682017-06-03T22:49:14Zioquatix (Samuel Williams)samuel@oriontransfer.net
<ul></ul><p>Thanks for that idea.</p>
<p>If that's the case, when appending to the write buffer:</p>
<pre><code>write_buffer = String.new.b
unicode_string = "\u1234".force_encoding("UTF-8")
write_buffer << unicode_string
write_buffer.encoding # Changed from ASCII-8BIT to Encoding:UTF-8
</code></pre>
<p>The only way I can think to fix this is to run +force_encoding+ on the write buffer after every append but this seems hugely inefficient.</p>
<p>Ideas?</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=652982017-06-07T01:08:28Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:samuel@oriontransfer.org" class="email">samuel@oriontransfer.org</a> wrote:</p>
<blockquote>
<p>Thanks for that idea.</p>
<p>If that's the case, when appending to the write buffer:</p>
<pre><code>write_buffer = String.new.b
unicode_string = "\u1234".force_encoding("UTF-8")
write_buffer << unicode_string
write_buffer.encoding # Changed from ASCII-8BIT to Encoding:UTF-8
</code></pre>
<p>The only way I can think to fix this is to run +force_encoding+ on the write buffer after every append but this seems hugely inefficient.</p>
<p>Ideas?</p>
</blockquote>
<p>String#force_encoding is done in-place so it should not be<br>
that slow, the String#<< would be the slow part since it<br>
involves at least one memcpy (worst case is realloc + 2 memcpy)</p>
<p>But I'm not sure why you would want to be setting data to<br>
UTF-8; I guess you got it from some 3rd-party library?</p>
<p>Maybe String#b! could be shorter alias for<br>
force_encoding(Encoding::UTF_8); but yeah, exposing writev via<br>
[Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: IO#writev (Closed)" href="https://bugs.ruby-lang.org/issues/9323">#9323</a>] is probably the best option, anyways.</p>
<p>Fwiw, I'm also not convinced String#<< behavior about changing<br>
write_buffer to Encoding::UTF-8 in your above example is good<br>
behavior on Ruby's part... But I don't know much about human<br>
language encodings, I am just a *nix plumber where a byte is a<br>
byte.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=653622017-06-14T02:00:31Zioquatix (Samuel Williams)samuel@oriontransfer.net
<ul></ul><blockquote>
<p>Fwiw, I'm also not convinced String#<< behavior about changing<br>
write_buffer to Encoding::UTF-8 in your above example is good<br>
behavior on Ruby's part...</p>
</blockquote>
<p>Agreed.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=668752017-09-25T08:08:51Zmatz (Yukihiro Matsumoto)matz@ruby.or.jp
<ul></ul><p>Sounds OK to me.</p>
<p>Matz.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=668762017-09-25T08:10:35Zakr (Akira Tanaka)akr@fsij.org
<ul></ul><p>At the developer meeting, we discuss that byteslice! and byteslice method should take same arguments.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=668842017-09-25T08:27:55Zduerst (Martin Dürst)duerst@it.aoyama.ac.jp
<ul></ul><p>normalperson (Eric Wong) wrote:</p>
<blockquote>
<p>Fwiw, I'm also not convinced String#<< behavior about changing<br>
write_buffer to Encoding::UTF-8 in your above example is good<br>
behavior on Ruby's part... But I don't know much about human<br>
language encodings, I am just a *nix plumber where a byte is a<br>
byte.</p>
</blockquote>
<p>This behavior may not be the best for this specific case, but in general, if one string is US-ASCII, and the other is UTF-8, then UTF-8 is a superset of US-ASCII, and concatenating the two will produce a string in UTF-8. Dropping the encoding would loose important information.</p>
<p>Please also note that you are actually on dangerous ground here. The above only works because the string doesn't contain any non-ASCII (high bit set) bytes. As soon as there is such a byte, there will be an error.</p>
<pre><code>s = "abcde".b
s.encoding # => #<Encoding:ASCII-8BIT>
s << "αβγδε" # => "abcdeαβγδε"
s.encoding # => #<Encoding:UTF-8>
</code></pre>
<p>but:</p>
<pre><code>t = "αβγδε".b # => "\xCE\xB1\xCE\xB2\xCE\xB3\xCE\xB4\xCE\xB5"
t.encoding # => #<Encoding:ASCII-8BIT>
t << "λμπρ" # => Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
</code></pre>
<p>So if you have an ASCII-8BIT buffer, and want to append something, always make sure you make the appended stuff also ASCII-8BIT.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=714652018-04-12T04:04:10Zioquatix (Samuel Williams)samuel@oriontransfer.net
<ul></ul><p>If you round trip UTF-8 to ASCII-8BIT and back again, the result should be the same IMHO. It's just the interpretation of the bytes which is different, but the underlying data should be the same. I still think adding <code>String#byteslice!</code> is a good idea. Has there been any progress?</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=714662018-04-12T04:06:50Zioquatix (Samuel Williams)samuel@oriontransfer.net
<ul></ul><p>By the way, I ended up implementing <a href="https://github.com/socketry/async-io/blob/master/lib/async/io/binary_string.rb" class="external">https://github.com/socketry/async-io/blob/master/lib/async/io/binary_string.rb</a> which I guess is okay but it's not ideal.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=727952018-07-03T23:07:05Zjanko (Janko Marohnić)janko@hey.com
<ul></ul><p>I support adding <code>String#byteslice!</code>. I've been using <code>String#byteslice</code> in custom IO-like objects that implement <code>IO#read</code> semantics, as the strings I work with don't necessarily have to be in binary encoding (otherwise I'd just use <code>String#slice</code>), they can also be in UTF-8. Since <code>IO#read</code> needs to work in terms of bytes, that's why I needed <code>String#byteslice</code>.</p>
<p>I've used the exact idiom from Samuel's original description in three different projects already:</p>
<ul>
<li><a href="https://github.com/janko-m/down/blob/ac4a32f296cb9cd8c12fc46a01a7e2f7c5fcd1b2/lib/down/chunked_io.rb#L169-L170" class="external">https://github.com/janko-m/down/blob/ac4a32f296cb9cd8c12fc46a01a7e2f7c5fcd1b2/lib/down/chunked_io.rb#L169-L170</a></li>
<li><a href="https://github.com/janko-m/goliath-rack_proxy/blob/7b359ff3ddfa3cba23c32220389abb39481735a9/lib/goliath/rack_proxy.rb#L134-L135" class="external">https://github.com/janko-m/goliath-rack_proxy/blob/7b359ff3ddfa3cba23c32220389abb39481735a9/lib/goliath/rack_proxy.rb#L134-L135</a></li>
<li><a href="https://github.com/socketry/falcon/blob/12b8818812b23c920e545e6b4c91e08e5348ee04/lib/falcon/adapters/input.rb#L80-L81" class="external">https://github.com/socketry/falcon/blob/12b8818812b23c920e545e6b4c91e08e5348ee04/lib/falcon/adapters/input.rb#L80-L81</a></li>
</ul>
<p><code>String#byteslice!</code> would allow reducing the code and probably end up with fewer strings at the end.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=988642022-08-23T13:31:22Zbyroot (Jean Boussier)byroot@ruby-lang.org
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-6 priority-4 priority-default closed" href="/issues/18972">Bug #18972</a>: String#byteslice should return BINARY (aka ASCII-8BIT) Strings</i> added</li></ul> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=988672022-08-23T13:48:39ZEregon (Benoit Daloze)
<ul></ul><p>Why not simply <code>String#slice!</code> if the string encoding is BINARY?</p>
<pre><code class="ruby syntaxhl" data-language="ruby"><span class="n">result</span> <span class="o">=</span> <span class="vi">@read_buffer</span><span class="p">.</span><span class="nf">slice!</span><span class="p">(</span><span class="n">size</span><span class="p">)</span> <span class="c1"># @read_buffer must be in the BINARY encoding</span>
</code></pre>
<p>For IO buffers, I think it's reasonable to ensure every string appended is BINARY, so the <code><<</code> gotcha is just a small inconvenience.<br>
And if it's not BINARY (or fixed-width encoding), how do you ensure you are not cutting e.g. in the middle of a UTF-8 character?</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=988682022-08-23T13:56:05ZEregon (Benoit Daloze)
<ul></ul><p>I think there is a misunderstand of what <code>byte*</code> methods are for.</p>
<p><code>byte*</code> methods are for dealing with byte indices and avoid the conversion between byte and character indices (which can be expensive for UTF-8).<br>
<code>byte*</code> methods are not "methods for BINARY strings".<br>
For BINARY strings it's fine/better to use the regular String methods since byte index=character index for BINARY and other fixed-width encodings.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=988702022-08-23T15:32:33Zbyroot (Jean Boussier)byroot@ruby-lang.org
<ul></ul><p>The PR is here in case someone feels like reviewing: <a href="https://github.com/ruby/ruby/pull/6275" class="external">https://github.com/ruby/ruby/pull/6275</a></p>
<p>As for the recently raised concerns, I don't really have any strong opinion. I implemented this on <a class="user active user-mention" href="https://bugs.ruby-lang.org/users/3344">@ioquatix (Samuel Williams)</a> 's demand, I personally believe that given Ruby's String implementation, calling <code>slice!</code> (or byteslice!) on a buffer is terrible for performance (cf <a href="https://github.com/ruby/net-protocol/pull/14" class="external">https://github.com/ruby/net-protocol/pull/14</a>).</p>
<p>That said, it's always very awkward to see code that mix <code>bytesize</code>, <code>byteslice</code> and <code>slice!</code>, every time I see some, I think I found a bug until I audit and figure that the string in indeed <code>Encoding::BINARY</code>. So for that alone I'm in favor of this method.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=989622022-08-27T06:30:49Zioquatix (Samuel Williams)samuel@oriontransfer.net
<ul></ul><p>Just to clarify I hope I have not demanded anything. "But if you want to have a go at it, that would be awesome" was all I said.</p>
<p>I have been trying the <code>buffer.force_encoding(Encoding::BINARY)</code> followed by <code>slice!</code> but you are right it does look awkward and given how easily a string can change to non-binary encoding, I also get the similar feeling about whether it's a bug or not (or could be in some unexpected scenario).</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=989632022-08-27T07:05:15Zbyroot (Jean Boussier)byroot@ruby-lang.org
<ul></ul><blockquote>
<p>I hope I have not demanded anything</p>
</blockquote>
<p>Yes, sorry, not what I meant, it's one of these words that has similar meaning in French, yet a radically different connotation.</p> Ruby master - Feature #13626: Add String#byteslice!https://bugs.ruby-lang.org/issues/13626?journal_id=989662022-08-27T11:16:08ZEregon (Benoit Daloze)
<ul></ul><p>I think the underlying issue is we want a string append method which does not change the receiver's encoding (and instead raises an EncodingError if it would need to change it).</p>