Feature #18254
closedAdd an `offset` parameter to String#unpack and String#unpack1
Description
When working with binary protocols it's common to have to first unpack some kind of header or type prefix, and then based on that unpack another part of the string.
For instance here's a code snippet from Dalli, the most common Memcached client:
while buf.bytesize - pos >= 24
header = buf.slice(pos, 24)
(key_length, _, body_length, cas) = header.unpack(KV_HEADER)
if key_length == 0
# all done!
@multi_buffer = nil
@position = nil
@inprogress = false
break
elsif buf.bytesize - pos >= 24 + body_length
flags = buf.slice(pos + 24, 4).unpack1("N")
key = buf.slice(pos + 24 + 4, key_length)
value = buf.slice(pos + 24 + 4 + key_length, body_length - key_length - 4) if body_length - key_length - 4 > 0
pos = pos + 24 + body_length
begin
values[key] = [deserialize(value, flags), cas]
rescue DalliError
end
else
# not enough data yet, wait for more
break
end
end
@position = pos
Proposal¶
If unpack
and unpack1
had an offset:
parameter, it would allow this kind of code to extract the fields it needs without allocating and copying as much strings, e.g.:
flags = buf.slice(pos + 24, 4).unpack1("N")
could be:
buf.unpack1("N", offset: pos + 24)
Updated by znz (Kazuhiro NISHIYAMA) almost 3 years ago
You can use unpack1("@#{pos + 24}N")
.
Updated by byroot (Jean Boussier) almost 3 years ago
Ah, I didn't know about it, but then you just allocated a string and converted an integer to string, so it's even slower than the slice
pattern:
# frozen_string_literal: true
require 'benchmark/ips'
STRING = Random.bytes(200)
POS = 12
Benchmark.ips do |x|
x.report("no-offset") { STRING.unpack1("N") }
x.report("slice-offset") { STRING.slice(POS, 4).unpack1("N")}
x.report("unpack-offset") { STRING.unpack1("@#{POS}N") }
x.compare!
end
# Ruby 2.7.2
Warming up --------------------------------------
no-offset 1.016M i/100ms
slice-offset 532.173k i/100ms
unpack-offset 321.805k i/100ms
Calculating -------------------------------------
no-offset 10.090M (± 1.2%) i/s - 50.782M in 5.033549s
slice-offset 5.318M (± 2.1%) i/s - 26.609M in 5.005346s
unpack-offset 3.205M (± 1.8%) i/s - 16.090M in 5.021922s
Comparison:
no-offset: 10090269.9 i/s
slice-offset: 5318453.9 i/s - 1.90x (± 0.00) slower
unpack-offset: 3205017.9 i/s - 3.15x (± 0.00) slower
Based on this, an offset
parameter could make the current code almost 2x more efficient.
Updated by byroot (Jean Boussier) almost 3 years ago
I submitted a pull request for it, https://github.com/ruby/ruby/pull/4984.
Updated by matz (Yukihiro Matsumoto) almost 3 years ago
Sounds reasonable. Accepted.
Matz.
Updated by mame (Yusuke Endoh) almost 3 years ago
Just a confirmation: the offset is byte-oriented, not character-oriented, right? There are a format "u" which is UTF-8 coding, so the behavior should be explained clearly in the document.
Updated by nobu (Nobuyoshi Nakada) almost 3 years ago
As the RDoc of String#unpack
states:
# Decodes <i>str</i> (which may contain binary data) according to the
# format string, returning an array of each value extracted. The
Isn't it clear that it is counted as binary?
Updated by byroot (Jean Boussier) almost 3 years ago
Just a confirmation: the offset is byte-oriented, not character-oriented, right? There
Yes.
Updated by duerst (Martin Dürst) almost 3 years ago
mame (Yusuke Endoh) wrote in #note-5:
Just a confirmation: the offset is byte-oriented, not character-oriented, right? There are a format "u" which is UTF-8 coding, so the behavior should be explained clearly in the document.
This is not only a problem of "explain it in the document". In order for this offset to work well, there should be a way to know how many bytes an invocation of String#unpack consumes. In many cases, that's very easy to calculate from the format string, but in others, in particular for UTF-8, it's not easy.
Updated by byroot (Jean Boussier) almost 3 years ago
That argument will indeed be pretty much worthless if you use the U
format, but I don't really see it as a blocker. It is meant to help binary parsers, I don't see U
making sense for these.
As for the documentation, we indeed need to be clear that it's a byte offset.
Updated by byroot (Jean Boussier) almost 3 years ago
I extended the pull request to clearly document the offset
keyword and stress that it's a byte offset. Hopefully that clears that concern.
Updated by mame (Yusuke Endoh) almost 3 years ago
@byroot (Jean Boussier) Thank you for adding documentation. I agree with merging.
there should be a way to know how many bytes an invocation of String#unpack consumes.
In fact, some committers discussed this point at the dev-meeting. However, in many cases, it is trivial (or able to calculate) for a programmer how many bytes are consumed. Also, it looks difficult to provide the feature by just extending the current API design of String#unpack
. So, matz concluded that those who really wants the feaature should create another ticket with use case discussion and a concrete API proposal.
Updated by byroot (Jean Boussier) almost 3 years ago
Agreed. The goal is to avoid slicing anyway, and to slice you need to know how many bytes you consumed.
If there's no other objections I'll merge in a day or two.
Updated by byroot (Jean Boussier) almost 3 years ago
- Status changed from Open to Closed
Applied in changeset git|e5319dc9856298f38aa9cdc6ed55e39ad0e8e070.
pack.c: add an offset argument to unpack and unpack1
[Feature #18254]
This is useful to avoid repeteadly copying strings when parsing binary formats