Project

General

Profile

Actions

Feature #18634

closed

Variable Width Allocation: Arrays

Added by peterzhu2118 (Peter Zhu) 4 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:107910]

Description

GitHub PR: https://github.com/ruby/ruby/pull/5660

Feature description

This patch changes arrays to allocate through Variable Width Allocation.

Similar to strings (implemented in ticket #18239), arrays allocated through Variable Width Allocation are embedded, meaning the contents of the array directly follow the array object headers.

When an array is resized, we fallback to allocating memory through the malloc heap. If the array was initially allocated in a larger slot, it would result in wastage of memory. However, in the benchmarks below, we can see that this wastage does not cause memory usage to increase significantly.

What's next

We're working on implementing cross size pool compaction for Variable Width Allocation. This will allow us to both downsize objects (to save memory) and upsize objects (to improve cache performance).

We're going to continue on implementing more types on Variable Width Allocation, such as Objects, Hashes, and ISeqs.

Benchmark setup

Benchmarking was done on a bare-metal Ubuntu machine on AWS. All benchmark results are using glibc by default, except when jemalloc is explicitly specified.

$ uname -a
Linux 5.13.0-1014-aws #15~20.04.1-Ubuntu SMP Thu Feb 10 17:55:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

glibc version:

$ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

jemalloc version:

$ apt list --installed | grep jemalloc

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libjemalloc-dev/focal,now 5.2.1-1ubuntu1 amd64 [installed]
libjemalloc2/focal,now 5.2.1-1ubuntu1 amd64 [installed,automatic]

To measure memory usage over time, the mstat tool was used.

master was benchmarked on commit bec492c77e. The branch was rebased on top of the same commit.

railsbench

For railsbench, we ran the railsbench benchmark. For both the performance and memory benchmarks, 25 runs were conducted for each combination (branch + glibc, master + glibc, branch + jemalloc, master + jemalloc).

For both glibc and jemalloc allocators, there is not a significant change in RPS, response times, or max memory usage. We can see in the RSS over time graph that the memory behavior of the branch and master is very similar.

glibc

+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| RPS                   | 810.38 | 809.50 | 1.00x       |
| p50 (ms)              | 1.20   | 1.20   | 1.00x       |
| p90 (ms)              | 1.32   | 1.32   | 1.00x       |
| p99 (ms)              | 1.75   | 1.72   | 0.98x       |
| p100 (ms)             | 5.53   | 6.02   | 1.09x       |
| Max memory usage (MB) | 90.19  | 90.45  | 1.00x       |
+-----------------------+--------+--------+-------------+

jemalloc

+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| RPS                   | 834.04 | 840.81 | 0.99x       |
| p50 (ms)              | 1.18   | 1.17   | 0.99x       |
| p90 (ms)              | 1.27   | 1.26   | 0.99x       |
| p99 (ms)              | 1.69   | 1.65   | 0.98x       |
| p100 (ms)             | 5.54   | 7.03   | 1.27x       |
| Max memory usage (MB) | 88.50  | 87.48  | 0.99x       |
+-----------------------+--------+--------+-------------+

discourse

Discourse was benchmarked through the script/bench.rb benchmarking script. The response times for the home endpoint and RSS memory usage is shown below.

We see a slight increase in memory usage (5%) with glibc and an insignificant memory usage increase with jemalloc. We don't see big differences in response times.

glibc

+-----------+--------+--------+-------------+
|           | Branch | master | Improvement |
+-----------+--------+--------+-------------+
| p50 (ms)  | 75     | 76     | 1.01x       |
| p90 (ms)  | 88     | 90     | 1.02x       |
| p99 (ms)  | 248    | 261    | 1.05x       |
| RSS (MB)  | 364.48 | 383.80 | 1.05x       |
+-----------+--------+--------+-------------+

jemalloc

+-----------+--------+--------+-------------+
|           | Branch | master | Improvement |
+-----------+--------+--------+-------------+
| p50 (ms)  | 73     | 73     | 1.00x       |
| p90 (ms)  | 84     | 86     | 1.02x       |
| p99 (ms)  | 241    | 242    | 1.00x       |
| RSS (MB)  | 347.56 | 349.86 | 1.01x       |
+-----------+--------+--------+-------------+

rdoc generation

In rdoc generation, we see a small improvement in performance in glibc and no change in performance for jemalloc. We see a small max memory usage increase for both glibc and jemalloc. Howevver, the RSS over time graph shows that except for the very end, the branch actually has lower memory usage than master.

glibc

+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| Time (s)              | 17.81  | 18.11  | 1.02x       |
| Max memory usage (MB) | 287.74 | 283.24 | 0.98x       |
+-----------------------+--------+--------+-------------+

jemalloc

+-----------------------+--------+--------+-------------+
|                       | Branch | master | Improvement |
+-----------------------+--------+--------+-------------+
| Time (s)              | 17.59  | 17.46  | 0.99x       |
| Max memory usage (MB) | 289.92 | 277.30 | 0.96x       |
+-----------------------+--------+--------+-------------+

optcarrot

We don't see a change in performance in optcarrot.

+------+--------+--------+-------------+
|      | Branch | master | Improvement |
+------+--------+--------+-------------+
| FPS  | 43.10  | 43.25  | 1.00x       |
+------+--------+--------+-------------+

Liquid benchmarks

We don't see a big change in performance in liquid benchmarks.

+----------------------+--------+--------+-------------+
|                      | Branch | master | Improvement |
+----------------------+--------+--------+-------------+
| Parse (i/s)          | 39.57  | 40.43  | 0.98x       |
| Render (i/s)         | 129.78 | 130.22 | 1.00x       |
| Parse & Render (i/s) | 28.43  | 28.89  | 0.98x       |
+----------------------+--------+--------+-------------+

Microbenchmarks

These microbenchmarks are very favourable for VWA since the arrays created have a length of 10, so they are embedded in VWA and allocated on the malloc heap for master.

+-------------+--------+--------+-------------+
|             | Branch | master | Improvement |
+-------------+--------+--------+-------------+
| Array#first | 2.282k | 2.014k | 1.13x       |
| Array#last  | 2.095k | 2.092k | 1.00x       |
| Array#[0]=  | 2.232k | 2.079k | 1.07x       |
| Array#[-1]= | 2.181k | 2.064k | 1.06x       |
| Array#each  | 319.92 | 314.22 | 1.02x       |
+-------------+--------+--------+-------------+

Benchmark source code

Updated by Eregon (Benoit Daloze) 3 months ago

Improvement is Branch/master?
It seems inconsistent for these 2 example lines:

| p100 (ms)             | 5.53   | 6.02   | 0.92x       |
| p100 (ms)             | 5.54   | 7.03   | 1.27x       |

i.e. the Branch takes less time for both, but once improvement is <1x and once >1x.

Actions #2

Updated by peterzhu2118 (Peter Zhu) 3 months ago

  • Description updated (diff)

Updated by peterzhu2118 (Peter Zhu) 3 months ago

Thanks for catching that. It was a math error.

Improvement for numbers where higher is better (e.g. RPS, IPS) is calculated as branch/master.

Improvement for numbers where lower is better (e.g. response times) is calculated as master/branch.

In other words, we want to target improvement >= 1 to not make performance worse.

Actions #4

Updated by peterzhu2118 (Peter Zhu) 3 months ago

  • Status changed from Open to Closed

Applied in changeset git|a51f30c6712798fc07e57f692d0d0e5ccc59acf1.


[Feature #18634] Implement Arrays on Variable Width Allocation

This commit implements arrays on Variable Width Allocation. This allows
longer arrays to be embedded (i.e. contents directly follow the object
header) which improves performance through better cache locality.

Actions

Also available in: Atom PDF