Feature #6361

Bitwise string operations

Added by Martin Bosslet almost 3 years ago. Updated over 2 years ago.

[ruby-core:44630]
Status:Rejected
Priority:Normal
Assignee:-

Description

I know this has been discussed a lot in the past (and if there's still
an open issue for this, I apologize, I couldn't find one), for example
in [1]. While it is generally no problem to implement this on the fly,
I still find that built-in support would be a real improvement. There
are quite some use cases in cryptography where this would come in very
handy, but I'm sure there are lots of other areas, too.

While of course I understand the reasons that were given in the previous
threads that ultimately lead to rejection, I still would like to reopen
the discussion as I felt that in every thread so far the consensus was
that having bitwise string operations would indeed be quite valuable.

[1] http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/34586

bytestring001.tar.gz (3.39 KB) Martin Bosslet, 04/30/2012 11:52 AM

bytestring-002.tar.gz (3.47 KB) Martin Bosslet, 05/03/2012 09:07 AM

6361-proposal.pdf (25.4 KB) Joshua Ballanco, 06/30/2012 12:13 AM


Related issues

Related to Ruby trunk - Feature #6767: Utility method to get a duplicated string whose encoding ... Closed 07/22/2012

History

#1 Updated by Charles Nutter almost 3 years ago

+1 from me...

There are many cases where high-perf treatment of a homogeneous array would be useful in Ruby, as evidenced by libraries like NArray and friends. NArray has been proposed for inclusion in Ruby in the past, but I don't know why it never was.

Providing a few bitwise operations on String that operate against the raw bytes would fulfill one of the most common use cases, that of a byte[] and symmetric manipulations of all elements.

#2 Updated by Yui NARUSE almost 3 years ago

  • Status changed from Open to Feedback

Propose a detailed spec with use cases.
For example the behavior of the case "aa" ^ "bbbb"

#3 Updated by Thomas Sawyer almost 3 years ago

Would it be better to have a dedicated class, e.g. a Blob ?

#4 Updated by Alexey Muranov almost 3 years ago

+1 for Blob, strings are not the only data that one might wish to encrypt.

#5 Updated by Joshua Ballanco almost 3 years ago

+1 for Blob (or, my preferred name: Data). I've mentioned this in the past, but we can look at Obj-C's example: NSString => a string class with encoding and code-point/character-wise manipulation, NSData => a class to encapsulate a random array of bytes.

#6 Updated by Sylvain Daubert almost 3 years ago

+1 : I use them every day.

#7 Updated by Nobuyoshi Nakada almost 3 years ago

Then what kind of methods should Blob have?

And does it need to be built-in?

#8 Updated by Martin Bosslet almost 3 years ago

nobu (Nobuyoshi Nakada) wrote:

Then what kind of methods should Blob have?

And does it need to be built-in?

A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time. I could imagine that we have two
kinds of "byte array" classes - one, mutable,
that shares COW semantics and all the other
optimizations with String, but with no notion of
encoding and a yet-to-be-defined interface.

And then a second class, which is basically the
immutable version of the first one. By sharing
only a reference we could ensure that the content
would not be proliferated and we could securely
erase its contents after use.

#9 Updated by Motohiro KOSAKI almost 3 years ago

On Fri, Apr 27, 2012 at 8:53 PM, MartinBosslet (Martin Bosslet)
Martin.Bosslet@googlemail.com wrote:

Issue #6361 has been updated by MartinBosslet (Martin Bosslet).

nobu (Nobuyoshi Nakada) wrote:

Then what kind of methods should Blob have?

And does it need to be built-in?

A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time. I could imagine that we have two
kinds of "byte array" classes - one, mutable,
that shares COW semantics and all the other
optimizations with String, but with no notion of
encoding and a yet-to-be-defined interface.

And then a second class, which is basically the
immutable version of the first one. By sharing
only a reference we could ensure that the content
would not be proliferated and we could securely
erase its contents after use.

I don't dislike a bult-in idea. But you haven't show a detailed spec
and I don't think I clearly understand your idea. Can you spend a
few time for writing a spec? (probably rough a few line explanation
is enough)

#10 Updated by Nobuyoshi Nakada almost 3 years ago

Hi,

(12/04/28 9:53), MartinBosslet (Martin Bosslet) wrote:

A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time.

It doesn't explain anything about why they need to be built-in. You
can just make them an external library altogether, can't you?

--
Nobu Nakada

#11 Updated by Martin Bosslet almost 3 years ago

kosaki (Motohiro KOSAKI) wrote:

I don't dislike a bult-in idea. But you haven't show a detailed spec
and I don't think I clearly understand your idea. Can you spend a
few time for writing a spec? (probably rough a few line explanation
is enough)

Yes, I thought about writing a proof of concept that could serve as
the blueprint for further discussion. I just wanted to make sure
first that there is a general interest and I also wanted to get your
ideas and suggestions first :) Nobuyoshi's point is a very valid one,
it helped me get a better view on the subject, and to rethink my own
position. (More on that below)

#12 Updated by Martin Bosslet almost 3 years ago

nobu (Nobuyoshi Nakada) wrote:

Hi,

(12/04/28 9:53), MartinBosslet (Martin Bosslet) wrote:

A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time.

It doesn't explain anything about why they need to be built-in. You
can just make them an external library altogether, can't you?

I thought about this, and yes, you are absolutely right, there's
nothing I couldn't do in a separate C extension. The only remaining
argument I have for built-in support:

I would have liked to use the "secure memory erasure" feature within
the OpenSSL extension, which I couldn't do if it's a separate
library outside of the stdlib.

I'm not sure if that's enough. Especially since we could add the
functionality to OpenSSL directly. Being a "nice to have" feature
in itself probably would not justify adding built-in support, but
I'm fine with either decision.

#13 Updated by Joshua Ballanco almost 3 years ago

On Saturday, April 28, 2012 at 8:52 AM, KOSAKI Motohiro wrote:

On Fri, Apr 27, 2012 at 8:53 PM, MartinBosslet (Martin Bosslet)
wrote:

Issue #6361 has been updated by MartinBosslet (Martin Bosslet).

nobu (Nobuyoshi Nakada) wrote:

Then what kind of methods should Blob have?

And does it need to be built-in?

A real advantage of having it built-in could be
that this gives us the chance to fix #5741 at
the same time. I could imagine that we have two
kinds of "byte array" classes - one, mutable,
that shares COW semantics and all the other
optimizations with String, but with no notion of
encoding and a yet-to-be-defined interface.

And then a second class, which is basically the
immutable version of the first one. By sharing
only a reference we could ensure that the content
would not be proliferated and we could securely
erase its contents after use.

I don't dislike a bult-in idea. But you haven't show a detailed spec
and I don't think I clearly understand your idea. Can you spend a
few time for writing a spec? (probably rough a few line explanation
is enough)

If I may intrude for a moment… I think the advantage to having a built in Data/Blob library would be that it could be used in all places where a data class is more appropriate than a string. For example, the Socket library currently returns Strings for data read in from a socket. I think a Data class is more appropriate here since the socket itself does not contain encoding information (i.e. either an arbitrary default encoding needs to be set, a heuristic can be used to guess the encoding, or the encoding is set by a previously agreed up convention; but you cannot ask a socket for its encoding).

As for a spec, I think it should be kept relatively simple. The one interesting optimization from NSData that might be useful is the option of copying bytes on instantiation. Copying is the default, but it is also possible to create a Data object that merely points at the storage of another live object and allows byte-wise manipulation. This is particularly interesting for the case of strings, since I would guess that String and Data would have identical storage layout, allowing one to optimize the case of creating a Data from a String with no copying.

A quick attempt at a spec:


Data.new #=> New, dynamically resizable container to store some bytes
Data.new('Test') #=> Can be created from any object that responds to #bytes with an enumerator
Data.new('Hello', copy_bytes: false) #=> Creates the Data from the String by merely pointing to the same storage

Data.open('./foo/test.txt') #=> Create a Data object from a File
Data.open('./bar/test.txt', copy_bytes: false) #=> Same as open above, but manipulates IO#pos for access
Data.write('./baz/test.txt') #=> Writes the bytes to disk.

d = Data.new(a_string)
d[2] #=> Returns the third byte, same as a_string.bytes.to_a[2]
d[2] = 42 #=> Same as a_string.setbyte(2, 42)
d.each #=> Equivalent to a_string.each_byte
d.length #=> Number of bytes currently being stored
d.slice(2, 4) #=> Similar to String#slice
d.slice(2, 4, copy_bytes: false) #=> New data object from slice shares storage with the original
d << other_data #=> Appends bytes from other_data
d.to_s #=> Returns a string using the default internal encoding
d.string_with_encoding('UTF-16') #=> Returns a string using the encoding passed


I know it seems like this class is just wrapping String and always defaulting to byte-wise operations, but it's more fundamental than that. Because there is no encoding on the bytes, there will never be an encoding error when working with them. This could be extremely useful for applications that combine bytes from multiple sources (e.g. Socket data + a file on disk + immediate strings in code) that could potentially have different encodings. By operating on bytes, you can defer the encoding checks until later, if at all.

  • Josh

#14 Updated by Martin Bosslet almost 3 years ago

  • File bytestring001.tar.gz added

If I may intrude for a moment…

No such thing - thank you for your ideas!

I think the advantage to having a built in Data/Blob library would be that it could be used in all places where a data class is more appropriate than a string. For example, the Socket library currently returns Strings for data read in from a socket. I think a Data class is more appropriate here since the socket itself does not contain encoding information (i.e. either an arbitrary default encoding needs to be set, a heuristic can be used to guess the encoding, or the encoding is set by a previously agreed up convention; but you cannot ask a socket for its encoding).

That's a good point. IO would definitely benefit from this
feature, Strings would only be needed when reading line by
line. Encoding would not be a problem anymore when reading
raw bytes.

A quick attempt at a spec:

I picked up Joshua's ideas and implemented a fully functional
ByteString class (simulating the desired behavior in Ruby for
now). I like ByteString better than Data, but that's me, I
would be fine with whatever name the majority likes best.

The class features a reduced String interface like Joshua
proposed already, plus bit-level operations and a
#to_hex/from_hex (I need those all the time).

In addition, there is ByteString::Immutable, which tries to
illustrate the behavior I had in mind for a class that would
allow secure in-memory erasure of its contents. It can only
be created from an IO directly (we can't use a String because
this would already leak the sensitive information).

By default, ByteString copies the contents it is given, but
if we want to include referencing as described by Joshua,
this could be integrated in Immutable, and the security
aspects could be implemented in yet another class, that
borrows most of its functionality from Immutable.

To demonstrate the behavior and as a basis for further
discussion, I included specs for both classes, I find those
to be the easiest way for discussing an interface.

The code can also be found at [1], in case somebody would
like to hack on it and improve it, I would update this
thread in that case.

[1] https://github.com/emboss/bytestring

#15 Updated by Martin Bosslet almost 3 years ago

  • File deleted (bytestring001.tar.gz)

#17 Updated by Thomas Sawyer almost 3 years ago

I'm not so sure using "String" in the name is a good idea.

The reason I suggested Blob is b/c that's what it is often called: http://en.wikipedia.org/wiki/Binary_large_object

#18 Updated by Martin Bosslet almost 3 years ago

trans (Thomas Sawyer) wrote:

I'm not so sure using "String" in the name is a good idea.

The reason I suggested Blob is b/c that's what it is often called: http://en.wikipedia.org/wiki/Binary_large_object

Hmm, I guess names are a very subtle topic. I stole from [1],
but I see your point why "String" could be confusing. I don't
really care about the name, I'm fine with anything that's
reasonable enough :)

[1] http://hackage.haskell.org/packages/archive/bytestring/0.9.2.1/doc/html/Data-ByteString.html

#19 Updated by Martin Bosslet almost 3 years ago

Added the possibility to specify an encoding when converting
to a String, forgot about that in the first version.

#20 Updated by Martin Dürst almost 3 years ago

On 2012/04/30 1:50, Joshua Ballanco wrote:

I know it seems like this class is just wrapping String and always defaulting to byte-wise operations, but it's more fundamental than that. Because there is no encoding on the bytes, there will never be an encoding error when working with them. This could be extremely useful for applications that combine bytes from multiple sources (e.g. Socket data + a file on disk + immediate strings in code) that could potentially have different encodings. By operating on bytes, you can defer the encoding checks until later, if at all.

I'm not saying I'm totally against this, but "extremely useful" could
also mean "too useful". There are clearly cases where one needs to put
things together at the byte level. But there are also quite some cases
that seem to "just work" when using byte-wise operations, at least as
long as nothing else but US-ASCII gets used. Things then blow up
terribly once some other characters get into the mix.

Actually, the binary/ASCII-8bit encoding is very close to a Blob. It was
mostly Akira Tanaka who didn't want to distinguish between "true" binary
and ASCII-8bit, because that would have made the use of regular
expressions with binary impossible or convoluted.

Despite the title of this issue, I didn't see any *bit*wise operations
(e.g. bitwise and/or/xor/not) proposed. Were you just taking them for
granted? What about adding these to String, maybe limiting them to
binary/ASCII-8bit?

Regards, Martin.

#21 Updated by Joshua Ballanco almost 3 years ago

On Thursday, May 3, 2012 at 9:16 AM, "Martin J. Dürst" wrote:

On 2012/04/30 1:50, Joshua Ballanco wrote:

I know it seems like this class is just wrapping String and always defaulting to byte-wise operations, but it's more fundamental than that. Because there is no encoding on the bytes, there will never be an encoding error when working with them. This could be extremely useful for applications that combine bytes from multiple sources (e.g. Socket data + a file on disk + immediate strings in code) that could potentially have different encodings. By operating on bytes, you can defer the encoding checks until later, if at all.

I'm not saying I'm totally against this, but "extremely useful" could

also mean "too useful". There are clearly cases where one needs to put

things together at the byte level. But there are also quite some cases

that seem to "just work" when using byte-wise operations, at least as

long as nothing else but US-ASCII gets used. Things then blow up

terribly once some other characters get into the mix.

So, as an addendum to the spec, what about adding a flag when doing a string conversion:

 d.string_with_encoding('UTF-8', reject_if_invalid: true)

So that we could ensure that the return value is always either nil or a string with valid encoding.

Actually, the binary/ASCII-8bit encoding is very close to a Blob. It was
mostly Akira Tanaka who didn't want to distinguish between "true" binary

and ASCII-8bit, because that would have made the use of regular

expressions with binary impossible or convoluted.

My problem with String and ASCII-8BIT/BINARY encoding currently is that you can't just set a string's encoding to binary and forget about encodings. You will still run into issues working with binary data using Ruby 1.9 strings. I demonstrated the issue here: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/40269 (where I, consequently, also made a plea for a Data/Blob type).

Despite the title of this issue, I didn't see any *bit*wise operations
(e.g. bitwise and/or/xor/not) proposed. Were you just taking them for

granted? What about adding these to String, maybe limiting them to

binary/ASCII-8bit?

I was taking bit-wise operations for granted. Ideally, a Data/Blob type would just represent N groupings of 8 1s and/or 0s, with byte-wise access and bit-wise manipulation. i.e. Less structured than an Array, less restrictive than a String. Just data.

#22 Updated by George Koehler almost 3 years ago

=begin
A new BinaryString (or Blob) class would entail several changes. For example, Array#pack and IO#read(n) would need to return BinaryString, not String. I prefer to keep String for binary strings.

The main reason, to perform bitwise operations on a String, is to use this String as an array of bits. The purpose of each bitwise operation is to clear or set some bits in this array. For example, a sieve of Erathosthenes might clear a bit in a String to show that a number is not prime.

String has no bitwise operations. So, the code to clear a bit is long, not simple.

sieve = "\xAC(\x8A\xA2("
# 25 is not prime. Clear bit 0x02 of byte 3.
sieve.setbyte(3, sieve.getbyte(3) & ~0x02)

A simpler way, with current Ruby, is to unpack this String into an Array of Integers. The code can then perform bitwise operations with these Integers.

sieve = "\xAC(\x8A\xA2(".unpack("C*")
# 25 is not prime. Clear bit 0x02 of byte 3.
sieve[3] &= ~0x02

The sieve in [1] uses an Array of 16-bit Fixnums. This works because Fixnum has bitwise operations. A sieve at [2] uses an Array of true and false. This also works, though it uses more memory.

[1] https://github.com/ruby/ruby/blob/trunk/lib/prime.rb
[2] http://rosettacode.org/wiki/Sieve_of_Eratosthenes#Ruby

If String would have bitwise operations, a sieve of Eratosthenes might use them. If String#& and String#~ would exist, then they can clear a bit.

sieve = "\xAC(\x8A\xA2("
sieve[3] &= ~"\x02"

With this design, String#& and String#~ never modify the original string. This design forgot that Strings are mutable. String#& must allocate a new string, before String#[]= modifies the original string. A better design might provide destructive methods that use bitwise operations to modify a slice of the original string.
=end

#23 Updated by Joshua Ballanco almost 3 years ago

Just throwing this out there, but a new class could be introduced first, then used as a replacement for Pack/IO/etc. in a later version (to allay breaking change concerns).

I understand the desire to simply work on an array of bits, but I think your sieve example is a little too simplistic. Are you never going to pass this string holding bits to another method? Are you never going to concatenate it with another string? If the answer is yes, then your solution would work. However, if there is ever the chance that someone might call #encode or #<< on your string, then the current Ruby string implementation is insufficient for working with bits.

#24 Updated by Nobuyoshi Nakada almost 3 years ago

=begin
: kernigh (George Koehler) wrote:

The main reason, to perform bitwise operations on a String, is to use this String as an array of bits. The purpose of each bitwise operation is to clear or set some bits in this array. For example, a sieve of Erathosthenes might clear a bit in a String to show that a number is not prime.

What you want seems (({BitArray})), not a kind of char string.

: kernigh (George Koehler) wrote:

If String would have bitwise operations, a sieve of Eratosthenes might use them. If String#& and String#~ would exist, then they can clear a bit.

sieve = "\xAC(\x8A\xA2("
sieve[3] &= ~"\x02"

Isn't (({BitArray})) clearer?

sieve = BitArray("\xAC(\x8A\xA2(")
sieve[26] = false
# or
sieve.clear(26)

=end

#25 Updated by Martin Bosslet almost 3 years ago

nobu (Nobuyoshi Nakada) wrote:

What you want seems (({BitArray})), not a kind of char string.

Thought the same, sounds much like this[1].

I think having both bit and byte operations in the same class at the same time would become a mess real soon. But apart from the "resolution" (bit, byte, multi-byte) the operations one would wish to perform on the array/vector are pretty much the same everywhere. So I had this idea: what if we made the resolution a parameter on initialization, and this would determine the further behavior? Something like

bits = ByteString.new(2)
bits[7] = 1 #Sets the 8th bit to 1, argument may only be in the range 0..1 (= [0;21 - 1])
bits[2] = 1 #Sets the 3rd bit to 1

bytes = ByteString.new(8)
bytes[0] = 0xcc #Sets the first byte to 0xcc, arguments are in the range of 0..255 (=[0;28-1])

uint32s = ByteString.new(32)
uint32s[0] = 65536 #Sets the first word (32bit) to 65536, arguments are in the range [0;232-1])

...

The initialization parameter would be a power of two that indicates the "width" in bits of a single unit of data that you wish to manipulate.

The behavior of "xor", "and", "or", "not" would follow naturally and would be implied by the underlying "resolution". This allows everyone to have their version of what fits best in their current domain, much like pack/unpack does.

[1] http://docs.oracle.com/javase/6/docs/api/java/util/BitSet.html

#26 Updated by Martin Bosslet almost 3 years ago

duerst (Martin Dürst) wrote:

Despite the title of this issue, I didn't see any *bit*wise operations
(e.g. bitwise and/or/xor/not) proposed. Were you just taking them for
granted? What about adding these to String, maybe limiting them to
binary/ASCII-8bit?

Sorry for the confusion, you are right, I should have actually written "bytewise"
in the title for my initial proposal.

There is a risk of exaggerating things, but on the other hand I think
a feature like this would simplify implementing low-level binary protocols
a lot by reducing the amount of bit shuffling which obviously is a common
source of errors, for example I really like what Erlang allows you to do [1].

I'm not against adding the functionality to String, it could just be that
a separate class might help to separate concerns more clearly. Adding more
methods to String would would mean more work when duck-typing a String etc.

[1] http://www.erlang.org/documentation/doc-5.6/doc/programming_examples/bit_syntax.html

#27 Updated by Martin Bosslet almost 3 years ago

kernigh (George Koehler) wrote:

=begin
With this design, String#& and String#~ never modify the original string. This design forgot that Strings are mutable. String#& must allocate a new string, before String#[]= modifies the original string. A better design might provide destructive methods that use bitwise operations to modify a slice of the original string.
=end

Good point, I like the idea of having an alternative destructive version, should boost performance a lot.
I think you actually need both because the "secure" version, ByteString::Immutable, needs to be immutable.
But in the general case both versions have their use. I'll add it to the proposal.

#28 Updated by Joshua Ballanco over 2 years ago

Second feature request. I picked this issue for this feature request because it contains the most recent discussion on this topic. Please let me know if a new, more specific, feature request would be more appropriate.

#29 Updated by Yusuke Endoh over 2 years ago

Your slide is received. Thank you!

A new class, not bitwise string operations, is now being discussed in this thread?
(I don't read the long discussion ;-)
If so, it may be good to get a fresh start by opening another ticket.

Anyway, your slide (and proposal) looks attractive and will be presented at the 7/21 meeting. Thank you again!

Yusuke Endoh mame@tsg.ne.jp

#30 Updated by Yusuke Endoh over 2 years ago

  • Status changed from Feedback to Rejected

Martin Bosslet and Joshua Ballanco,

Sorry but this proposal was rejected at the developer meeting (7/21).

Matz explicitly said he will not accept another class than String.
He hates Python way.

Two additional notes:

  • The original request (bitwise string operation) was not considered
    because it is not essential for ByteString; if you want, please file
    a new ticket for the specific feature with the detailed spec including
    corner cases (e.g., ).

  • Matz wants to save the motivating example in the slide (below)
    in other approach than the proposal:

    "\xff".force_encoding("BINARY") << "\xff"
    #=> incompatible character encodings:
    ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)

    We discussed three approaches:

  1. adding String#b that returns BINARY string:

    "\xff".force_encoding("BINARY") << "\xff".b

  2. adding a new `percent' syntax for binary string:

    "\xff".force_encoding("BINARY") << %b{\xff}

  3. suppressing the exception in some way

    Some people (including Matz) liked (3), but akr showed strong objection
    because it causes breaking a character (Sorry I didn't understand his
    opinion precisely). (2) is difficult because of syntax extension.
    So we are now keen on (1).
    This is discussed in #6767 (Sorry, in Japanese!)

Yusuke Endoh mame@tsg.ne.jp

#31 Updated by Martin Bosslet over 2 years ago

mame (Yusuke Endoh) wrote:

Martin Bosslet and Joshua Ballanco,

Sorry but this proposal was rejected at the developer meeting (7/21).

Matz explicitly said he will not accept another class than String.
He hates Python way.

No problem! It's very specific and I understand well that it certainly
adds complexity that is not needed often enough. I still believe that
having something like ByteString would boost low-level IO performance a
lot - I'll take the gem route then, I have some ideas there. Maybe I'll
get the chance to discuss some of them with Matz at LSRC next week :)

Two additional notes:

  • The original request (bitwise string operation) was not considered because it is not essential for ByteString; if you want, please file a new ticket for the specific feature with the detailed spec including corner cases (e.g., ).

OK, I'll open another issue for this.

Also available in: Atom PDF