Feature #17816
closedMove C heap allocations for RVALUE object data into GC heap
Description
Pull Request:¶
Introduction¶
This work supersedes the work in PR: 4107 and Redmine: 17570. We've reimplemented the feature to make the diff smaller, easier to maintain and less intrusive to existing data structures.
We're working at Shopify to restructure Ruby memory management in order to allow objects to occupy more than one heap slot. This will allow previously heap allocated data to be stored next to its associated RVALUE
slot in a contiguous memory region.
We believe that this will simplify the internals of the GC by:
- Removing the distinction between embedded and heap allocated objects as everything will now effectively be embedded across multiple slots.
- Allowing us to remove the transient heap. The transient heap reduces the number of
malloc
calls for heap allocated objects by deferring them until the object is promoted to an old object. When objects no longer need to callmalloc
, the transient heap can be removed.
We believe that there will be performance improvements across most Ruby codebases as a result of these simplifications. Objects will also have improved data locality, resulting in improved hardware cache performance.
Summary of changes¶
This is a rewrite of a feature initially proposed in PR #4107.
The referenced PR adds the core implementation and API in order to store arbitrary length data inside contiguous free slots on the heap. It also includes a reference implementation for T_CLASS
objects, that would usually allocate the rb_classext_t
struct on the system heap. The current API is:
-
RVARGC_NEWOBJ_OF
- A reimplementation of theNEWOBJ_OF
macro that takes an additional parameterpayload_length
, the length of the payload data to store in bytes. -
rb_rvargc_payload_data_ptr
- avoid *
to the start of the region where the extra data can be allocated.
We've introduced a new type T_PAYLOAD
and a struct RPayload
that contains a single VALUE flags
. We use the FL_USER
bits to store the number of payload slots so that we can stride over the payload body in most places where heap walking is required (as these slots can now contain user defined data they will not have accurate flags
and so most type checks will be incorrect).
When RVARGC_NEWOBJ_OF
is called with a payload size, we calculate the number of slots required to store the RVALUE
, an RPayload
and the payload data itself. We then first search the ractors newobj_cache
for a region of the required size, remove the slots from the freelist and initialize them.
Then a pointer to the first allocatable byte in the payload body section can be found using rb_rvargc_payload_data_ptr
.
These changes can be enabled using the compile time flag USE_RVARGC=1
.
- We do not expect anyone to run production Ruby applications with this flag enabled. This is an experimental feature which we will improve incrementally.
- Should these experiments prove unsuccessful in the long term, We will completely remove this feature and all related code
- This PR has no performance implications when
USE_RVARGC
is disabled. Allocation ofRVALUE
s in a single slot behaves almost identically to before this change (see Benchmarking data.
Features (and challenges)¶
-
T_PAYLOAD
is fully integrated with the existing GC. The entire payload region will be treated as one single slot for marking, sweeping and generational purposes. In contrast with our previous attempt this means we no longer need to disable incremental marking, nor do we need to use an extra bitmap attached to a heap_page. - All slots that are part of a
T_CLASS
and its payload region are pinned, so compaction will not move them. This has impacted the effectiveness of compaction, but unlike our previous PR, doesn't require us to disable compaction completely. - RSS is significantly larger when
USE_RVARGC
is enabled. This is due to our (currently) naive approach to free region allocation.
Next steps¶
With this merged. We have several different directions we intend to investigate
- Performance benchmarking: Analysing L1, 2 and 3 cache performance to decide where best to introduce RVarGC first, and what (if any) performance gains we'll see by improving data locality. Our current speculative contenders are Arrays, ivars, strings.
- Improvements to the way the Payload data is managed: move the payload length into the RVALUE itself, and inline the payload body, removing the need for the
T_PAYLOAD
object entirely. - Compaction improvements: Investigating which compaction algorithms perform better with objects of variable size.
- Resize payload regions. Currently we have no support for resizing payload regions. This must be fixed before we can support many of the different Ruby types.
- Free region allocation: Find a way of managing the freelist that performs better with allocations of contiguous regions than the current singly linked freelist appraoch.
The end game for this work is to be remove the requirement for an RVALUE
to be exactly 40 bytes wide. This is obviously a long game, of which this PR takes the first steps.
Benchmarking¶
We used Railsbench to compare the performance of master with our branch, with USE_RVARGC=0
ubuntu@ip-172-31-42-217:~/railsbench$ chruby master
ubuntu@ip-172-31-42-217:~/railsbench$ setarch x86_64 -R nice -20 taskset -c 75 ./bin/bench
ruby 3.1.0dev (2021-04-19T12:40:29Z master 50f17241a3) [x86_64-linux]
{"cppflags"=>"-DUSE_RVARGC=0", "optflags"=>"-O3 -fno-fast-math"}
Warming up...
Benchmark: 10000 requests
Request per second: 747.3 [#/s] (mean)
Percentage of the requests served within a certain time (ms)
50% 1.32
66% 1.36
75% 1.38
80% 1.39
90% 1.42
95% 1.46
98% 1.53
99% 1.84
100% 11.40
ubuntu@ip-172-31-42-217:~/railsbench$ setarch x86_64 -R nice -20 taskset -c 75 ./bin/bench
ruby 3.1.0dev (2021-04-20T10:02:39Z mvh-rvargc 2045bfb7f7) [x86_64-linux]
{"cppflags"=>"-DUSE_RVARGC=0", "optflags"=>"-O3 -fno-fast-math"}
Warming up...
Benchmark: 10000 requests
Request per second: 746.3 [#/s] (mean)
Percentage of the requests served within a certain time (ms)
50% 1.31
66% 1.37
75% 1.39
80% 1.39
90% 1.41
95% 1.44
98% 1.51
99% 1.83
100% 8.97
And the same comparison using Optcarrot:
ubuntu@ip-172-31-42-217:~/optcarrot$ chruby master
ubuntu@ip-172-31-42-217:~/optcarrot$ ./bin/optcarrot --benchmark examples/Lan_Master.nes
fps: 43.62907118228718
checksum: 59662
ubuntu@ip-172-31-42-217:~/optcarrot$ chruby rvargc
ubuntu@ip-172-31-42-217:~/optcarrot$ ./bin/optcarrot --benchmark examples/Lan_Master.nes
fps: 43.90831352849611
checksum: 59662