Feature #21785
openAdd signed and unsigned LEB128 support to pack / unpack
Description
Hi,
I'd like to add signed and unsigned LEB128 support to the pack and unpack methods. LEB128 is a variable length encoding scheme for integers. You can read the wikipedia entry about it here: https://en.wikipedia.org/wiki/LEB128
LEB128 is used in DWARF, WebAssembly, MQTT, and Protobuf. I'm sure there are other formats, but these are the ones I'm familiar with.
I sent a pull request here: https://github.com/ruby/ruby/pull/15589
I'm proposing K for the unsigned version and k for the signed version. I just picked k because it was available, I'm open to other format strings.
Thanks for consideration!
Updated by tenderlovemaking (Aaron Patterson) 2 days ago
Sorry, I probably should have put an example in the original post. Here is a sample of the usage:
irb(main):003> [0xFFF].pack("K")
=> "\xFF\x1F"
irb(main):004> [0xFFF].pack("K").unpack1("K")
=> 4095
irb(main):005> [-123].pack("k")
=> "\x85\x7F"
irb(main):006> [-123].pack("k").unpack1("k")
=> -123
Updated by matz (Yukihiro Matsumoto) 2 days ago
I am positive about the addition of LEB128. But I don't really like K/k because it doesn't remind me of LEB128 at all (though I know we've used L, E, B already).
Given that the only case pairs not yet used are k, r, and y, either R (vaRiable length), or Y (next to W - BER) would be better than K/k.
Matz.
Updated by tenderlovemaking (Aaron Patterson) 1 day ago
matz (Yukihiro Matsumoto) wrote in #note-2:
I am positive about the addition of LEB128. But I don't really like K/k because it doesn't remind me of LEB128 at all (though I know we've used L, E, B already).
Given that the only case pairs not yet used are k, r, and y, either R (vaRiable length), or Y (next to W - BER) would be better than K/k.
Matz.
Thanks for the feedback. I've updated the patch to use R/r!
Updated by mame (Yusuke Endoh) 1 day ago
It's a shame unpack doesn't tell you how many bytes it read. You'd probably want a unpack variant that returns the final offset too, or a specifier that returns the current offset (like o?).
bytes = "\x01\x02\x03"
offset = 0
leb128_value1, offset = bytes.unpack("Ro", offset: offset) #=> 1
leb128_value2, offset = bytes.unpack("Ro", offset: offset) #=> 2
leb128_value3, offset = bytes.unpack("Ro", offset: offset) #=> 3
Updated by tenderlovemaking (Aaron Patterson) about 21 hours ago
mame (Yusuke Endoh) wrote in #note-4:
It's a shame
unpackdoesn't tell you how many bytes it read. You'd probably want aunpackvariant that returns the final offset too, or a specifier that returns the current offset (likeo?).bytes = "\x01\x02\x03" offset = 0 leb128_value1, offset = bytes.unpack("Ro", offset: offset) #=> 1 leb128_value2, offset = bytes.unpack("Ro", offset: offset) #=> 2 leb128_value3, offset = bytes.unpack("Ro", offset: offset) #=> 3
You could tell how many bytes you read based on the size of the leb128_value returned. But I agree, getting the information directly from unpack would be nice.
Updated by mame (Yusuke Endoh) about 18 hours ago
You could tell how many bytes you read based on the size of the leb128_value returned.
That apparoach is unreliable because LEB128 is redundant. For example, both "\x03" and "\x83\x00" are valid LEB128 encodings of the value 3.
See the note of the section Values - Integers, in the Wasm spec.
https://webassembly.github.io/spec/core/binary/values.html#integers
Updated by tenderlovemaking (Aaron Patterson) about 3 hours ago
mame (Yusuke Endoh) wrote in #note-6:
That apparoach is unreliable because LEB128 is redundant. For example, both
"\x03"and"\x83\x00"are valid LEB128 encodings of the value 3.
Ah of course. I didn't think about that. 🤦♀️