Project

General

Profile

Actions

Feature #21785

open

Add signed and unsigned LEB128 support to pack / unpack

Feature #21785: Add signed and unsigned LEB128 support to pack / unpack

Added by tenderlovemaking (Aaron Patterson) 2 days ago. Updated about 3 hours ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:124258]

Description

Hi,

I'd like to add signed and unsigned LEB128 support to the pack and unpack methods. LEB128 is a variable length encoding scheme for integers. You can read the wikipedia entry about it here: https://en.wikipedia.org/wiki/LEB128

LEB128 is used in DWARF, WebAssembly, MQTT, and Protobuf. I'm sure there are other formats, but these are the ones I'm familiar with.

I sent a pull request here: https://github.com/ruby/ruby/pull/15589

I'm proposing K for the unsigned version and k for the signed version. I just picked k because it was available, I'm open to other format strings.

Thanks for consideration!

Updated by tenderlovemaking (Aaron Patterson) 2 days ago Actions #1 [ruby-core:124259]

Sorry, I probably should have put an example in the original post. Here is a sample of the usage:

irb(main):003> [0xFFF].pack("K")
=> "\xFF\x1F"
irb(main):004> [0xFFF].pack("K").unpack1("K")
=> 4095
irb(main):005> [-123].pack("k")
=> "\x85\x7F"
irb(main):006> [-123].pack("k").unpack1("k")
=> -123

Updated by matz (Yukihiro Matsumoto) 2 days ago Actions #2 [ruby-core:124268]

I am positive about the addition of LEB128. But I don't really like K/k because it doesn't remind me of LEB128 at all (though I know we've used L, E, B already).

Given that the only case pairs not yet used are k, r, and y, either R (vaRiable length), or Y (next to W - BER) would be better than K/k.

Matz.

Updated by tenderlovemaking (Aaron Patterson) 1 day ago Actions #3 [ruby-core:124272]

matz (Yukihiro Matsumoto) wrote in #note-2:

I am positive about the addition of LEB128. But I don't really like K/k because it doesn't remind me of LEB128 at all (though I know we've used L, E, B already).

Given that the only case pairs not yet used are k, r, and y, either R (vaRiable length), or Y (next to W - BER) would be better than K/k.

Matz.

Thanks for the feedback. I've updated the patch to use R/r!

Updated by mame (Yusuke Endoh) 1 day ago 1Actions #4 [ruby-core:124287]

It's a shame unpack doesn't tell you how many bytes it read. You'd probably want a unpack variant that returns the final offset too, or a specifier that returns the current offset (like o?).

bytes = "\x01\x02\x03"
offset = 0
leb128_value1, offset = bytes.unpack("Ro", offset: offset) #=> 1
leb128_value2, offset = bytes.unpack("Ro", offset: offset) #=> 2
leb128_value3, offset = bytes.unpack("Ro", offset: offset) #=> 3

Updated by tenderlovemaking (Aaron Patterson) about 21 hours ago Actions #5 [ruby-core:124294]

mame (Yusuke Endoh) wrote in #note-4:

It's a shame unpack doesn't tell you how many bytes it read. You'd probably want a unpack variant that returns the final offset too, or a specifier that returns the current offset (like o?).

bytes = "\x01\x02\x03"
offset = 0
leb128_value1, offset = bytes.unpack("Ro", offset: offset) #=> 1
leb128_value2, offset = bytes.unpack("Ro", offset: offset) #=> 2
leb128_value3, offset = bytes.unpack("Ro", offset: offset) #=> 3

You could tell how many bytes you read based on the size of the leb128_value returned. But I agree, getting the information directly from unpack would be nice.

Updated by mame (Yusuke Endoh) about 18 hours ago Actions #6 [ruby-core:124298]

You could tell how many bytes you read based on the size of the leb128_value returned.

That apparoach is unreliable because LEB128 is redundant. For example, both "\x03" and "\x83\x00" are valid LEB128 encodings of the value 3.
See the note of the section Values - Integers, in the Wasm spec.
https://webassembly.github.io/spec/core/binary/values.html#integers

Updated by tenderlovemaking (Aaron Patterson) about 3 hours ago Actions #7 [ruby-core:124304]

mame (Yusuke Endoh) wrote in #note-6:

That apparoach is unreliable because LEB128 is redundant. For example, both "\x03" and "\x83\x00" are valid LEB128 encodings of the value 3.

Ah of course. I didn't think about that. 🤦‍♀️

Actions

Also available in: PDF Atom