Feature #18634
closedVariable Width Allocation: Arrays
Description
GitHub PR: https://github.com/ruby/ruby/pull/5660 ¶
Feature description¶
This patch changes arrays to allocate through Variable Width Allocation.
Similar to strings (implemented in ticket #18239), arrays allocated through Variable Width Allocation are embedded, meaning the contents of the array directly follow the array object headers.
When an array is resized, we fallback to allocating memory through the malloc heap. If the array was initially allocated in a larger slot, it would result in wastage of memory. However, in the benchmarks below, we can see that this wastage does not cause memory usage to increase significantly.
What's next¶
We're working on implementing cross size pool compaction for Variable Width Allocation. This will allow us to both downsize objects (to save memory) and upsize objects (to improve cache performance).
We're going to continue on implementing more types on Variable Width Allocation, such as Objects, Hashes, and ISeqs.
Benchmark setup¶
Benchmarking was done on a bare-metal Ubuntu machine on AWS. All benchmark results are using glibc by default, except when jemalloc is explicitly specified.
$ uname -a
Linux 5.13.0-1014-aws #15~20.04.1-Ubuntu SMP Thu Feb 10 17:55:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
glibc version:
$ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
jemalloc version:
$ apt list --installed | grep jemalloc
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libjemalloc-dev/focal,now 5.2.1-1ubuntu1 amd64 [installed]
libjemalloc2/focal,now 5.2.1-1ubuntu1 amd64 [installed,automatic]
To measure memory usage over time, the mstat tool was used.
master was benchmarked on commit bec492c77e. The branch was rebased on top of the same commit.
railsbench¶
For railsbench, we ran the railsbench benchmark. For both the performance and memory benchmarks, 25 runs were conducted for each combination (branch + glibc, master + glibc, branch + jemalloc, master + jemalloc).
For both glibc and jemalloc allocators, there is not a significant change in RPS, response times, or max memory usage. We can see in the RSS over time graph that the memory behavior of the branch and master is very similar.
glibc¶
+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| RPS                   | 810.38 | 809.50 | 1.00x       |
| p50 (ms)              | 1.20   | 1.20   | 1.00x       |
| p90 (ms)              | 1.32   | 1.32   | 1.00x       |
| p99 (ms)              | 1.75   | 1.72   | 0.98x       |
| p100 (ms)             | 5.53   | 6.02   | 1.09x       |
| Max memory usage (MB) | 90.19  | 90.45  | 1.00x       |
+-----------------------+--------+--------+-------------+

jemalloc¶
+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| RPS                   | 834.04 | 840.81 | 0.99x       |
| p50 (ms)              | 1.18   | 1.17   | 0.99x       |
| p90 (ms)              | 1.27   | 1.26   | 0.99x       |
| p99 (ms)              | 1.69   | 1.65   | 0.98x       |
| p100 (ms)             | 5.54   | 7.03   | 1.27x       |
| Max memory usage (MB) | 88.50  | 87.48  | 0.99x       |
+-----------------------+--------+--------+-------------+

discourse¶
Discourse was benchmarked through the script/bench.rb benchmarking script. The response times for the home endpoint and RSS memory usage is shown below.
We see a slight increase in memory usage (5%) with glibc and an insignificant memory usage increase with jemalloc. We don't see big differences in response times.
glibc¶
+-----------+--------+--------+-------------+
|           | Branch | master | Improvement |
+-----------+--------+--------+-------------+
| p50 (ms)  | 75     | 76     | 1.01x       |
| p90 (ms)  | 88     | 90     | 1.02x       |
| p99 (ms)  | 248    | 261    | 1.05x       |
| RSS (MB)  | 364.48 | 383.80 | 1.05x       |
+-----------+--------+--------+-------------+
jemalloc¶
+-----------+--------+--------+-------------+
|           | Branch | master | Improvement |
+-----------+--------+--------+-------------+
| p50 (ms)  | 73     | 73     | 1.00x       |
| p90 (ms)  | 84     | 86     | 1.02x       |
| p99 (ms)  | 241    | 242    | 1.00x       |
| RSS (MB)  | 347.56 | 349.86 | 1.01x       |
+-----------+--------+--------+-------------+
rdoc generation¶
In rdoc generation, we see a small improvement in performance in glibc and no change in performance for jemalloc. We see a small max memory usage increase for both glibc and jemalloc. Howevver, the RSS over time graph shows that except for the very end, the branch actually has lower memory usage than master.
glibc¶
+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| Time (s)              | 17.81  | 18.11  | 1.02x       |
| Max memory usage (MB) | 287.74 | 283.24 | 0.98x       |
+-----------------------+--------+--------+-------------+

jemalloc¶
+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| Time (s)              | 17.59  | 17.46  | 0.99x       |
| Max memory usage (MB) | 289.92 | 277.30 | 0.96x       |
+-----------------------+--------+--------+-------------+

optcarrot¶
We don't see a change in performance in optcarrot.
+------+--------+--------+-------------+
|      | Branch | master | Improvement |
+------+--------+--------+-------------+
| FPS  | 43.10  | 43.25  | 1.00x       |
+------+--------+--------+-------------+
Liquid benchmarks¶
We don't see a big change in performance in liquid benchmarks.
+----------------------+--------+--------+-------------+
|                      | Branch | master | Improvement |
+----------------------+--------+--------+-------------+
| Parse (i/s)          | 39.57  | 40.43  | 0.98x       |
| Render (i/s)         | 129.78 | 130.22 | 1.00x       |
| Parse & Render (i/s) | 28.43  | 28.89  | 0.98x       |
+----------------------+--------+--------+-------------+
Microbenchmarks¶
These microbenchmarks are very favourable for VWA since the arrays created have a length of 10, so they are embedded in VWA and allocated on the malloc heap for master.
+-------------+--------+--------+-------------+
|             | Branch | master | Improvement |
+-------------+--------+--------+-------------+
| Array#first | 2.282k | 2.014k | 1.13x       |
| Array#last  | 2.095k | 2.092k | 1.00x       |
| Array#[0]=  | 2.232k | 2.079k | 1.07x       |
| Array#[-1]= | 2.181k | 2.064k | 1.06x       |
| Array#each  | 319.92 | 314.22 | 1.02x       |
+-------------+--------+--------+-------------+
        
           Updated by Eregon (Benoit Daloze) over 3 years ago
          Updated by Eregon (Benoit Daloze) over 3 years ago
          
          
        
        
      
      Improvement is Branch/master?
It seems inconsistent for these 2 example lines:
| p100 (ms)             | 5.53   | 6.02   | 0.92x       |
| p100 (ms)             | 5.54   | 7.03   | 1.27x       |
i.e. the Branch takes less time for both, but once improvement is <1x and once >1x.
        
           Updated by peterzhu2118 (Peter Zhu) over 3 years ago
          Updated by peterzhu2118 (Peter Zhu) over 3 years ago
          
          
        
        
      
      - Description updated (diff)
        
           Updated by peterzhu2118 (Peter Zhu) over 3 years ago
          Updated by peterzhu2118 (Peter Zhu) over 3 years ago
          
          
        
        
      
      Thanks for catching that. It was a math error.
Improvement for numbers where higher is better (e.g. RPS, IPS) is calculated as branch/master.
Improvement for numbers where lower is better (e.g. response times) is calculated as master/branch.
In other words, we want to target improvement >= 1 to not make performance worse.
        
           Updated by peterzhu2118 (Peter Zhu) over 3 years ago
          Updated by peterzhu2118 (Peter Zhu) over 3 years ago
          
          
        
        
      
      - Status changed from Open to Closed
Applied in changeset git|a51f30c6712798fc07e57f692d0d0e5ccc59acf1.
[Feature #18634] Implement Arrays on Variable Width Allocation
This commit implements arrays on Variable Width Allocation. This allows
longer arrays to be embedded (i.e. contents directly follow the object
header) which improves performance through better cache locality.