Project

General

Profile

Actions

Bug #14900

closed

Extra allocation in String#byteslice

Added by janko (Janko Marohnić) over 4 years ago. Updated 2 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:87845]
Tags:

Description

When executing String#byteslice with a range, I noticed that sometimes the original string is allocated again. When I run the following script:

require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end

it outputs

50000
100000
6
5

The one with 50000 bytes is the result of String#byteslice, but the one with 100000 bytes is the duplicated original string. I expected only the result of String#byteslice to be amongst new allocations.

If instead of the last 50000 bytes I slice the first 50000 bytes, the extra duplication doesn't occur.

# ...
  string.byteslice(0, 50_000)
# ...
50000
5

It's definitely ok if the implementation of String#bytesize allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.

EDIT: It seems that String#slice has the same issue.

Actions #1

Updated by janko (Janko Marohnić) over 4 years ago

  • Description updated (diff)

Updated by ioquatix (Samuel Williams) over 4 years ago

Nice catch I will try to verify on my end too

Updated by ioquatix (Samuel Williams) over 4 years ago

Okay, I reproduced the error. I made a test case here:

https://github.com/ioquatix/ruby/commit/9fb5cd644209efc79378841e1b6eb644876393b0

I test both prefix and postfix as you discuss in your initial report.

Updated by ioquatix (Samuel Williams) over 4 years ago

One thing I noticed if I freeze source string, the extra memory allocation goes away.

Updated by ioquatix (Samuel Williams) over 4 years ago

I think there are several things to consider here:

  • Even though the string appears to be two allocations, it's only one allocation but the 2nd one is sharing the first's data.
  • I guess that subsequent slice would share the underling frozen string?
  • In some cases, byteslice might be less efficient, e.g. 100Mbyte buffer, slice the last 10bytes, it makes an entire copy of the source string, but all you were interested in was 10 bytes at the end.

Updated by funny_falcon (Yura Sokolov) over 4 years ago

@ioquatix (Samuel Williams), your patch doesn't seems to be correct for me on first glance.

Imagine pipelined RPC server:

  • we read data into buffer
  • while buffer larger than request size
  • detect first request and split buffer into request and rest of buffer

Same for any other binary parser.

With current behavior, operation "get rest of buffer" will copy buffer into shared frozen string only once.
With your patch it will copy every time.
So instead on linear complexity we will have quadratic complexity.

Thinking second time, it is possible to use frozen string explicitely for buffer, just not so trivial (while there are not enough data for request, buffer should not frozen, and << should be used, otherwise it should be frozen, and + used).

Some programs will certainly become slower with this change, until they fixed.

I'm not against the patch, but new behavior should be carefully documented and mentioned in a Changelog as a change, that could negatively affect performance if not concerned.

Updated by ioquatix (Samuel Williams) over 4 years ago

Yeah, I agree, this patch probably isn't right, but I just try to figure it out what is going on and suggest a solution. The outcome may be that this is normal behaviour. Thanks for your feedback.

Updated by ioquatix (Samuel Williams) over 4 years ago

The way I've implemented it now (as in your first example) is something like this:

@buffer = read_data
if @buffer.bytesize > REQUEST_SIZE
    @buffer.freeze
    request_buffer = @buffer.byteslice(0, REQUEST_SIZE)
    @buffer = @buffer.byteslice(REQUEST_SIZE, @buffer.bytesize)
end

Because we will recreate @buffer from remainder, it makes sense to freeze the source to avoid generating a hidden copy. Does that make sense?

I believe if we were to propose an implementation of byteslice! it would look like the above.

Updated by ioquatix (Samuel Williams) over 4 years ago

I played around with my assumptions here. By far the worst from a memory POV was slice!, which given a string of 5MB, produces 7.5MB allocations. The equivalent sequence of byteslice as above only allocates 2.5MB.

Here were my comparisons:

measure_memory("Initial allocation") do
	string = "a" * 5*1024*1024
	string.freeze
end # => 5.0 MB

measure_memory("Byteslice from start to middle") do
	# Why does this need to allocate memory? Surely it can share the original allocation?
	x = string.byteslice(0, string.bytesize / 2)
end # => 2.5 MB

measure_memory("Byteslice from middle to end") do
	string.byteslice(string.bytesize / 2, string.bytesize)
end # => 0.0 MB

measure_memory("Slice! from start to middle") do
	string.dup.slice!(0, string.bytesize / 2) # dup doesn't make any difference to size of allocations
end # => 7.5 MB

measure_memory("Byte slice into two halves") do
	head = string.byteslice(0, string.bytesize / 2)
	remainder = string.byteslice(string.bytesize / 2, string.bytesize)
end # 2.5 MB

(examples are also here: https://github.com/socketry/async-io/blob/master/examples/allocations/byteslice.rb)

In the best case, the last example should be able to reuse the source string entirely, but Ruby doesn't seem capable of doing that yet. Perhaps a specific implementation of byteslice! could address this use case with zero allocations?

Actions #11

Updated by jeremyevans0 (Jeremy Evans) over 2 years ago

  • Tracker changed from Bug to Feature
  • ruby -v deleted (ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17])
  • Backport deleted (2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN)

Updated by ioquatix (Samuel Williams) 2 months ago

I updated my PR and started investigating the changes between the current implementation and the proposed implementation, and after discussing it with @byroot (Jean Boussier) we found a bug: https://github.com/ruby/ruby/pull/6443

This actually is far worse in terms of over-allocation, so fixing this is a good first step.

We are also considering introducing byteslice! which should be a better interface for when you "don't care about the source buffer" and thus can "internally freeze it to avoid memory allocations". https://bugs.ruby-lang.org/issues/13626

That being said, I'll update the PR here (impacting byteslice) and check whether it still makes sense.

Updated by ioquatix (Samuel Williams) 2 months ago

Okay, so @byroot (Jean Boussier) and I discussed this issue at length.

The simplest way to get the behaviour you want is to call freeze on the source string.

require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

string.freeze # ADD THIS LINE

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end

This has the desired result that the source string is not copied, but is internally used as a shared buffer for the string.byteslice. However, it prevents you from modifying the source string.

The reason why it works this way, is because as @funny_falcon (Yura Sokolov) pointed out, many people are slicing the front off a buffer several times. If the buffer is huge, this would result in many large memcpy.

By freezing the string you avoid this memcpy. Alternatively, byteslice does this for you internally. It's a bit more nuanced than that because smaller strings are always copied, so this only kicks in for larger strings.

So I wanted to think a bit about how to do this efficiently - I think something like this can be pretty good:

buffer = String.new # allocation
while true
	# Efficiently read into the buffer:
	if buffer.empty?
		io.read(1024, buffer)
	else
		buffer << io.read(1024)
	end

	# Consume the buffer in chunks:
	while size = consume(buffer)
		# Freeze the buffer so it will be shared during processing:
		buffer.freeze
		
		buffer = buffer.byteslice(size..-1) # shared root string - no memcpy or allocation
	end

	# Unfreeze the buffer if needed.
	buffer = +buffer
end

The proposed PR basically skips the internal sharing mechanism unless you call buffer.freeze. In current Ruby, it's optional, and if you don't freeze it, Ruby is forced to create an internal dup, which is what you are seeing.

We should investigate the performance of the typical IO usage, to see which way is better.

Updated by ioquatix (Samuel Williams) 2 months ago

By the way, ideally, I think you can implement this:

buffer = String.new # allocation
while true
	# Efficiently read into the buffer:
	if buffer.empty?
		io.read(1024, buffer)
	else
		buffer << io.read(1024)
	end

	# Consume the buffer in chunks:
	while size = consume(buffer)
		buffer.byteslice!(size..-1) # shared root string - no memcpy or allocation
	end
end

Updated by byroot (Jean Boussier) 2 months ago

  • Status changed from Open to Closed
  • Backport set to 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
  • Tracker changed from Feature to Bug

I think if we stick solely to the ticket description, I think https://github.com/ruby/ruby/pull/6443 (2e88bca) fixes the issue (the extra useless allocation).

Also note that this extra allocation was a shared string, so it wasn't copying the string content.

So we can probably close this as fixed now.

If we wish to improve typical IO buffering code, we can open another issue.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0