Project

General

Profile

Actions

Feature #17816

closed

Move C heap allocations for RVALUE object data into GC heap

Added by eightbitraptor (Matt V-H) over 3 years ago. Updated over 3 years ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:103520]

Description

Pull Request:

Github PR: 4391

Introduction

This work supersedes the work in PR: 4107 and Redmine: 17570. We've reimplemented the feature to make the diff smaller, easier to maintain and less intrusive to existing data structures.

We're working at Shopify to restructure Ruby memory management in order to allow objects to occupy more than one heap slot. This will allow previously heap allocated data to be stored next to its associated RVALUE slot in a contiguous memory region.

We believe that this will simplify the internals of the GC by:

  • Removing the distinction between embedded and heap allocated objects as everything will now effectively be embedded across multiple slots.
  • Allowing us to remove the transient heap. The transient heap reduces the number of malloc calls for heap allocated objects by deferring them until the object is promoted to an old object. When objects no longer need to call malloc, the transient heap can be removed.

We believe that there will be performance improvements across most Ruby codebases as a result of these simplifications. Objects will also have improved data locality, resulting in improved hardware cache performance.

Summary of changes

This is a rewrite of a feature initially proposed in PR #4107.

The referenced PR adds the core implementation and API in order to store arbitrary length data inside contiguous free slots on the heap. It also includes a reference implementation for T_CLASS objects, that would usually allocate the rb_classext_t struct on the system heap. The current API is:

  • RVARGC_NEWOBJ_OF - A reimplementation of the NEWOBJ_OF macro that takes an additional parameter payload_length, the length of the payload data to store in bytes.
  • rb_rvargc_payload_data_ptr - a void * to the start of the region where the extra data can be allocated.

We've introduced a new type T_PAYLOAD and a struct RPayload that contains a single VALUE flags. We use the FL_USER bits to store the number of payload slots so that we can stride over the payload body in most places where heap walking is required (as these slots can now contain user defined data they will not have accurate flags and so most type checks will be incorrect).

When RVARGC_NEWOBJ_OF is called with a payload size, we calculate the number of slots required to store the RVALUE, an RPayload and the payload data itself. We then first search the ractors newobj_cache for a region of the required size, remove the slots from the freelist and initialize them.

Then a pointer to the first allocatable byte in the payload body section can be found using rb_rvargc_payload_data_ptr.

These changes can be enabled using the compile time flag USE_RVARGC=1.

  • We do not expect anyone to run production Ruby applications with this flag enabled. This is an experimental feature which we will improve incrementally.
  • Should these experiments prove unsuccessful in the long term, We will completely remove this feature and all related code
  • This PR has no performance implications when USE_RVARGC is disabled. Allocation of RVALUEs in a single slot behaves almost identically to before this change (see Benchmarking data.

Features (and challenges)

  • T_PAYLOAD is fully integrated with the existing GC. The entire payload region will be treated as one single slot for marking, sweeping and generational purposes. In contrast with our previous attempt this means we no longer need to disable incremental marking, nor do we need to use an extra bitmap attached to a heap_page.
  • All slots that are part of a T_CLASS and its payload region are pinned, so compaction will not move them. This has impacted the effectiveness of compaction, but unlike our previous PR, doesn't require us to disable compaction completely.
  • RSS is significantly larger when USE_RVARGC is enabled. This is due to our (currently) naive approach to free region allocation.

Next steps

With this merged. We have several different directions we intend to investigate

  • Performance benchmarking: Analysing L1, 2 and 3 cache performance to decide where best to introduce RVarGC first, and what (if any) performance gains we'll see by improving data locality. Our current speculative contenders are Arrays, ivars, strings.
  • Improvements to the way the Payload data is managed: move the payload length into the RVALUE itself, and inline the payload body, removing the need for the T_PAYLOAD object entirely.
  • Compaction improvements: Investigating which compaction algorithms perform better with objects of variable size.
  • Resize payload regions. Currently we have no support for resizing payload regions. This must be fixed before we can support many of the different Ruby types.
  • Free region allocation: Find a way of managing the freelist that performs better with allocations of contiguous regions than the current singly linked freelist appraoch.

The end game for this work is to be remove the requirement for an RVALUE to be exactly 40 bytes wide. This is obviously a long game, of which this PR takes the first steps.

Benchmarking

We used Railsbench to compare the performance of master with our branch, with USE_RVARGC=0

ubuntu@ip-172-31-42-217:~/railsbench$ chruby master
ubuntu@ip-172-31-42-217:~/railsbench$ setarch x86_64 -R nice -20 taskset -c 75 ./bin/bench
ruby 3.1.0dev (2021-04-19T12:40:29Z master 50f17241a3) [x86_64-linux]
{"cppflags"=>"-DUSE_RVARGC=0", "optflags"=>"-O3 -fno-fast-math"}
Warming up...
Benchmark: 10000 requests

Request per second: 747.3 [#/s] (mean)

Percentage of the requests served within a certain time (ms)
  50%    1.32
  66%    1.36
  75%    1.38
  80%    1.39
  90%    1.42
  95%    1.46
  98%    1.53
  99%    1.84
 100%   11.40
ubuntu@ip-172-31-42-217:~/railsbench$ setarch x86_64 -R nice -20 taskset -c 75 ./bin/bench
ruby 3.1.0dev (2021-04-20T10:02:39Z mvh-rvargc 2045bfb7f7) [x86_64-linux]
{"cppflags"=>"-DUSE_RVARGC=0", "optflags"=>"-O3 -fno-fast-math"}
Warming up...
Benchmark: 10000 requests

Request per second: 746.3 [#/s] (mean)

Percentage of the requests served within a certain time (ms)
  50%    1.31
  66%    1.37
  75%    1.39
  80%    1.39
  90%    1.41
  95%    1.44
  98%    1.51
  99%    1.83
 100%    8.97

And the same comparison using Optcarrot:

ubuntu@ip-172-31-42-217:~/optcarrot$ chruby master
ubuntu@ip-172-31-42-217:~/optcarrot$ ./bin/optcarrot --benchmark examples/Lan_Master.nes
fps: 43.62907118228718
checksum: 59662
ubuntu@ip-172-31-42-217:~/optcarrot$ chruby rvargc
ubuntu@ip-172-31-42-217:~/optcarrot$ ./bin/optcarrot --benchmark examples/Lan_Master.nes
fps: 43.90831352849611
checksum: 59662

Related issues 1 (0 open1 closed)

Related to Ruby master - Feature #18045: Variable Width Allocation Phase IIClosedpeterzhu2118 (Peter Zhu)Actions
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0