Project

General

Profile

Actions

Feature #18254

closed

Add an `offset` parameter to String#unpack and String#unpack1

Added by byroot (Jean Boussier) about 2 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:105660]

Description

When working with binary protocols it's common to have to first unpack some kind of header or type prefix, and then based on that unpack another part of the string.

For instance here's a code snippet from Dalli, the most common Memcached client:

while buf.bytesize - pos >= 24
  header = buf.slice(pos, 24)
  (key_length, _, body_length, cas) = header.unpack(KV_HEADER)

  if key_length == 0
    # all done!
    @multi_buffer = nil
    @position = nil
    @inprogress = false
    break

  elsif buf.bytesize - pos >= 24 + body_length
    flags = buf.slice(pos + 24, 4).unpack1("N")
    key = buf.slice(pos + 24 + 4, key_length)
    value = buf.slice(pos + 24 + 4 + key_length, body_length - key_length - 4) if body_length - key_length - 4 > 0

    pos = pos + 24 + body_length

    begin
      values[key] = [deserialize(value, flags), cas]
    rescue DalliError
    end

  else
    # not enough data yet, wait for more
    break
  end
end
@position = pos

Proposal

If unpack and unpack1 had an offset: parameter, it would allow this kind of code to extract the fields it needs without allocating and copying as much strings, e.g.:

flags = buf.slice(pos + 24, 4).unpack1("N")

could be:

buf.unpack1("N", offset: pos + 24)

Updated by znz (Kazuhiro NISHIYAMA) about 2 months ago

You can use unpack1("@#{pos + 24}N").

Updated by byroot (Jean Boussier) about 2 months ago

Ah, I didn't know about it, but then you just allocated a string and converted an integer to string, so it's even slower than the slice pattern:

# frozen_string_literal: true
require 'benchmark/ips'

STRING = Random.bytes(200)
POS = 12
Benchmark.ips do |x|
  x.report("no-offset") { STRING.unpack1("N") }
  x.report("slice-offset") { STRING.slice(POS, 4).unpack1("N")}
  x.report("unpack-offset") { STRING.unpack1("@#{POS}N") }
  x.compare!
end
# Ruby 2.7.2
Warming up --------------------------------------
           no-offset     1.016M i/100ms
        slice-offset   532.173k i/100ms
       unpack-offset   321.805k i/100ms
Calculating -------------------------------------
           no-offset     10.090M (± 1.2%) i/s -     50.782M in   5.033549s
        slice-offset      5.318M (± 2.1%) i/s -     26.609M in   5.005346s
       unpack-offset      3.205M (± 1.8%) i/s -     16.090M in   5.021922s

Comparison:
           no-offset: 10090269.9 i/s
        slice-offset:  5318453.9 i/s - 1.90x  (± 0.00) slower
       unpack-offset:  3205017.9 i/s - 3.15x  (± 0.00) slower

Based on this, an offset parameter could make the current code almost 2x more efficient.

Updated by matz (Yukihiro Matsumoto) about 1 month ago

Sounds reasonable. Accepted.

Matz.

Updated by mame (Yusuke Endoh) about 1 month ago

Just a confirmation: the offset is byte-oriented, not character-oriented, right? There are a format "u" which is UTF-8 coding, so the behavior should be explained clearly in the document.

Updated by nobu (Nobuyoshi Nakada) about 1 month ago

As the RDoc of String#unpack states:

  # Decodes <i>str</i> (which may contain binary data) according to the
  # format string, returning an array of each value extracted. The

Isn't it clear that it is counted as binary?

Updated by byroot (Jean Boussier) about 1 month ago

Just a confirmation: the offset is byte-oriented, not character-oriented, right? There

Yes.

Updated by duerst (Martin Dürst) about 1 month ago

mame (Yusuke Endoh) wrote in #note-5:

Just a confirmation: the offset is byte-oriented, not character-oriented, right? There are a format "u" which is UTF-8 coding, so the behavior should be explained clearly in the document.

This is not only a problem of "explain it in the document". In order for this offset to work well, there should be a way to know how many bytes an invocation of String#unpack consumes. In many cases, that's very easy to calculate from the format string, but in others, in particular for UTF-8, it's not easy.

Updated by byroot (Jean Boussier) about 1 month ago

That argument will indeed be pretty much worthless if you use the U format, but I don't really see it as a blocker. It is meant to help binary parsers, I don't see U making sense for these.

As for the documentation, we indeed need to be clear that it's a byte offset.

Updated by byroot (Jean Boussier) about 1 month ago

I extended the pull request to clearly document the offset keyword and stress that it's a byte offset. Hopefully that clears that concern.

Updated by mame (Yusuke Endoh) about 1 month ago

byroot (Jean Boussier) Thank you for adding documentation. I agree with merging.

there should be a way to know how many bytes an invocation of String#unpack consumes.

In fact, some committers discussed this point at the dev-meeting. However, in many cases, it is trivial (or able to calculate) for a programmer how many bytes are consumed. Also, it looks difficult to provide the feature by just extending the current API design of String#unpack. So, matz concluded that those who really wants the feaature should create another ticket with use case discussion and a concrete API proposal.

Updated by byroot (Jean Boussier) about 1 month ago

Agreed. The goal is to avoid slicing anyway, and to slice you need to know how many bytes you consumed.

If there's no other objections I'll merge in a day or two.

Actions #13

Updated by byroot (Jean Boussier) about 1 month ago

  • Status changed from Open to Closed

Applied in changeset git|e5319dc9856298f38aa9cdc6ed55e39ad0e8e070.


pack.c: add an offset argument to unpack and unpack1

[Feature #18254]

This is useful to avoid repeteadly copying strings when parsing binary formats

Actions

Also available in: Atom PDF