Project

General

Profile

Actions

Feature #17790

open

Have a way to clear a String without resetting its capacity

Added by byroot (Jean Boussier) about 1 month ago. Updated 25 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:103342]

Description

In some tight loop it can be useful to re-use a buffer string. For instance:

buffer = String.new(encoding: Encoding::BINARY, capacity: 1024)

10.times do
  build_next_packet(buffer)
  udp_socket.send(buffer)
  buffer.clear
end

Currently Array#clear preserve the Array capacity, but String#clear doesn't:

>> puts ObjectSpace.dump(Array.new(20).clear)
{"address":"0x7fd3260a1558", "type":"ARRAY", "class":"0x7fd3230972e0", "length":0, "memsize":200, "flags":{"wb_protected":true}}
>> puts ObjectSpace.dump(String.new(encoding: Encoding::BINARY, capacity: 1024).clear)
{"address":"0x7fd322a8a320", "type":"STRING", "class":"0x7fd3230b75b8", "embedded":true, "bytesize":0, "value":"", "memsize":40, "flags":{"wb_protected":true}}

It would be useful if String#clear wouldn't free allocated memory, but if it's a backward compatibility concern to change it, then maybe another method could make sense?

Updated by marcandre (Marc-Andre Lafortune) about 1 month ago

Looks good. I doubt very much that this would be a compatibility concern.

Updated by Eregon (Benoit Daloze) about 1 month ago

I think that some people and libraries might expect that the #clear method releases the allocated memory.
This might be useful when e.g. reusing a String as a large buffer and the new usage might need less memory.
Not saying that's a good pattern, because IMHO it would be better to allocate a new String, but I'd guess it's used in some cases.

In general I think it is surprising that after #clear the object might "leak" a significant amount of memory, not observable from the typical Ruby methods on that collection.

#clear feels a bit similar to #close to me.

Updated by dylants (Dylan Thacker-Smith) about 1 month ago

What makes sense probably depends on how long lived the String is and whether there is an upper-bound to how much needs to be stored in it.

For instance, there may be a rare iteration of a loop that adds a lot to the String, which might be excessive to hold onto for most iterations. As such, we may want to shrink the String back to the capacity we expect most iterations to use, such as the initial capacity.

It would be nice to have more control over the capacity of collections, such as Array or String. If we know exactly how much memory is needed, then it would be useful to have shrink(capacity = bytesize) and reserve(capacity) for this purpose. It is also common to not know at least a specific amount of memory needs to be reserved, but to not know exactly how much is needed, so providing a capacity method gives more control over how to expand memory (e.g. double capacity until it is at least the minimum amount needed, then call reserve with that expanded capacity). This would provide the primitives needed to avoid unnecessary reallocations, which convenience methods can always be built on top of.

Updated by dylants (Dylan Thacker-Smith) about 1 month ago

If we want clear to shrink memory by default, a shrink: true keyword argument could be added so the user could override this default with clear(shrink: false). This would make the change less risky, since it wouldn't change the behaviour of existing code.

Updated by byroot (Jean Boussier) about 1 month ago

so providing a capacity method gives more control over how to expand memory

Agreed. Without also exposing the capacity, my proposed change would be a big footgun.

Maybe String#capacity and String#capacity= would make sense? But then there's the question of the behavior if you set the capacity to lower than the size. Should it truncate? (this could corrupt UTF-8 for instance) or should it raise?

Additionally I think Array and Hash should expose similar ways of querying and reserving capacity.

Updated by normalperson (Eric Wong) about 1 month ago

jean.boussier@gmail.com wrote:

so providing a capacity method gives more control over how to expand memory

Agreed. Without also exposing the capacity, my proposed change would be a big footgun.

Yes, rb_str_resize(str, 0) is common to workaround the lack of
escape analysis inside the core VM and some C exts. I think
it's reasonable for Rubyists to use String#clear for the same
purpose.

Maybe String#capacity and String#capacity= would make sense? But then there's the question of the behavior if you set the capacity to lower than the size. Should it truncate? (this could corrupt UTF-8 for instance) or should it raise?

Yes, but I don't know what it should do for corruption. It
would also be useful for IO#read-like methods if/when that
supports destination buffer offsets.

Additionally I think Array and Hash should expose similar ways of querying and reserving capacity.

Probably, yes. It seems a bit low-level, but I've been favoring
"semi-automatic" memory management since we probably can't have
escape analysis due to the C API.

Updated by Eregon (Benoit Daloze) about 1 month ago

byroot (Jean Boussier) wrote in #note-6:

Maybe String#capacity and String#capacity= would make sense? But then there's the question of the behavior if you set the capacity to lower than the size. Should it truncate? (this could corrupt UTF-8 for instance) or should it raise?

I would say definitely not truncate.
So either raise an error, or do nothing, since the capacity is kind of a hint.

My feeling is handling the capacity in Ruby code feels wrong and like C++ code.
The example snippet above can just allocate a new String per call to build_next_packet and I doubt that would affect performance much:

10.times do
  buffer = build_next_packet # build_next_packet can use String.new(capacity: 1024), it probably knows better about it anyway
  udp_socket.send(buffer)
end

A bit more GC pressure, but I think anyway there would be many other allocations in build_next_packet that this wouldn't matter much.

IMHO we should really have a separate Buffer class and String class here.
For instance if the String class uses a rope representation (e.g. on TruffleRuby) a capacity doesn't make much sense.

Updated by Eregon (Benoit Daloze) about 1 month ago

I think clear(shrink: true/false) would be fine to add.

I'm not sure if it's really needed in practice though.

Updated by byroot (Jean Boussier) about 1 month ago

My feeling is handling the capacity in Ruby code feels wrong and like C++ code.

This is really meant for the few low level places where it matters. For context this came up when trying to optimize a StatsD client, which is quite a hotspot.

More generally there are cases when you know you'll return a large String/Array/Hash, and you know the size in advance, and it can make sense to pre-reserve capacity rather than having it resized a dozen times. Amusingly enough the C API allow to create an array or hash with a specific capacity.

IMHO we should really have a separate Buffer class and String class here.

Possibly yes. StringIO is kind of meant to be that buffer class, but it's not always easier to use, and often is slower than using a string.

I think clear(shrink: true/false) would be fine to add.

I think it would be good, but would also require a String#resize(capacity) as well.

So the pattern would be:

buffer = String.new(encoding: Encoding::BINARY, capacity: 1024)
loop do

  # do your thing
  buffer.clear(shrink: false)
  buffer.resize(1024)
end

Updated by Dan0042 (Daniel DeLorme) about 1 month ago

What about buffer.clear(capacity: 1024)
Or maybe even buffer.clear(capacity: 1024..8192)
I think that's more straightforward than separate clear and resize operations.

Updated by dylants (Dylan Thacker-Smith) about 1 month ago

Maybe String#capacity and String#capacity= would make sense?

Using capacity= for the method name would set the assumption that the capacity is exactly that after the call. However, with embedded strings, the capacity would be fixed until it grows larger than what can be embedded in the object struct. That's why I suggested shrink as the name to shrink the capacity.

But then there's the question of the behavior if you set the capacity to lower than the size. Should it truncate? (this could corrupt UTF-8 for instance) or should it raise?

I think that should raise, since it seems too implicit to have a call to set the capacity also truncate the contents.

I do think it would be useful to be able to efficiently truncate a string, but that could be done with a separate method. For example, String#size= could be provided and could efficiently truncate a binary string and would avoid corrupting UTF-8 strings.

There are limited String methods for working with byte offsets for variable width encoded strings like UTF-8, so I'm actually surprised that there is already a String#byteslice method. Nothing prevents that from creating an invalid UTF-8 string, however, I don't see the use case for using that with non-binary strings. I think a way to truncate using byte offset would be more useful as part of the C API for now.

My feeling is handling the capacity in Ruby code feels wrong and like C++ code.

Performance sensitive code will naturally be written based on what is more efficient for the machine (the primary concern of C++), such as preferring mutations to avoid object allocations. Providing primitive low-level methods for performance sensitive ruby code will allow more pleasant optimization than forcing the code to be rewritten in a native extension to do the same optimization.

String#resize

size refers to the size of the contents, so resize seems like it would affect that size (e.g. truncating or padding) instead of just the capacity.

What about buffer.clear(capacity: 1024)
Or maybe even buffer.clear(capacity: 1024..8192)
I think that's more straightforward than separate clear and resize operations.

Coupling capacity control with clearing the buffer makes the capacity control less general. For instance, it doesn't support shrinking the buffer to fit the contents or growing the buffer once before multiple appends.

Updated by dsisnero (Dominic Sisneros) 25 days ago

That was what I was hoping the addition of memoryview would help with but the only way to interact with the memoryview in ruby is with Fiddle
If we had a ByteArray class that implemented memoryview

buffer = ByteArray.new('this is a string'.bytes)

mv = Fiddle::MemoryView.new(buffer)
mv.byte_size # 16
first8 = mv[0:8] # once Fiddle::MemoryView allows you to slice
socket.write(first8) # once socket.write allows you to write memoryview objects without changing into string.

What memoryview is supposed to do is allow the reading and writing with zero copy because it knows the offsets, strides, etc of the underlying obj in the buffer

So, I think we should instead finish the parts of memoryview that are missing:

1) IO support for memoryview (read into a memoryview object and write from a memoryview object without converting to strings)
2) Add classes that implement the memoryview protocol

3) change String.bytes to return a new ByteArray class that implements the memoryview protocol
4) add a ruby extension that allows you to use memory view objects in ruby (not just Fiddle)
mv = MemoryView.new(obj)
mv[offset:offset_size] #slicing memory views
mv.cast(format, shape) - change format or shape of memoryview but keep data
mv.format
mv.strides
mv.shape

Actions

Also available in: Atom PDF