Project

General

Profile

Actions

Feature #21846

closed

Add a fast path for GC sweeping

Feature #21846: Add a fast path for GC sweeping
2

Added by eightbitraptor (Matt V-H) about 2 months ago. Updated about 1 month ago.

Status:
Closed
Assignee:
-
Target version:
-
[ruby-core:124599]

Description

Github PR 15885

Summary

This proposal adds a fast path through the garbage collector's sweep phase that skips expensive cleanup operations for objects that don't require them. Simple embedded objects without external resources, finalizers, or weak references are added directly to the freelist, resulting in a 20-40% reduction in sweep time and 1-2% improvement in overall application performance.

Background

Ruby's garbage collector currently uses a single code path for sweeping all objects. During the sweep phase, every unmarked object goes through rb_gc_obj_free_vm_weak_references to clean up entries in the object_id table (id2ref_tbl) and generic fields table, followed by rb_gc_obj_free for type-specific cleanup and memory deallocation. This is necessary for objects with external resources, but a lot of objects created are "simple" - they have no external allocations, no registered object_id, no generic ivars, and no finalizers.

For these simple objects, the sweep phase does significant work even though there's nothing to clean up. Every object is checked against the id2ref_tbl table (even if ObjectSpace._id2ref was never called), they're checked for finalizers, and they go through type-specific free functions that often just return immediately. In applications that allocate many temporary objects, this overhead accumulates significantly.

Proposal

A new predicate function gc_sweep_fast_path_p determines whether an object can skip the expensive cleanup calls. Objects eligible for the fast path are added directly to the freelist after clearing the write barrier unprotected bit.

Two conditions always require the slow path:

  1. The object has a finalizer (FL_FINALIZE flag is set)
  2. The object has an observed object_id stored in the id2ref_tbl (checked via rb_shape_has_object_id)

If neither condition applies, we move on to some type specific flag checks

We use the fast path in these cases:

  • T_OBJECT: Embedded objects only (no heap-allocated instance variable array)
  • T_STRING: Embedded strings that aren't frozen
  • T_ARRAY: Embedded arrays
  • T_HASH: Embedded hashes (ie. no external st_table)
  • T_BIGNUM: Embedded bignums
  • T_STRUCT: Embedded structs
  • T_FLOAT, T_RATIONAL, T_COMPLEX
  • T_DATA: TypedData that is embedded with a NULL or RUBY_DEFAULT_FREE dfree function
  • T_IMEMO: Safe subtypes only (constcache, cref, ifunc, memo, svar, throw_data)

For all the above types we'll also check whether any generic ivars are registered in the generic_fields_tbl. If they have generic ivars set, then we'll use the slow path to clean them up.

All other types use the slow path.

Compatibility

Currently this patch is is disabled for modular GC builds (BUILDING_MODULAR_GC) because we need access to the id2ref_tbl, and I haven't pushed it through the GC API.

Evaluation

Benchmark Results

All benchmarks were run on an Amazon c5.metal bare-metal VM with turbo boost and ASLR disabled, CPU frequency locked and using the performance governor, and pinned to a single core using taskset.

GC-specific benchmarks show significant sweep time improvements:

Benchmark      │    Wall(B)   Sweep(B)  Mark(B) │    Wall(E)   Sweep(E)  Mark(E) │   Wall Δ  Sweep Δ
───────────────┼────────────────────────────────┼────────────────────────────────┼──────────────────
null           │     0.000s        1ms      4ms │     0.000s        1ms      4ms │    0%     0%
hash1          │     4.330s      875ms     46ms │     3.960s      531ms     44ms │ +8.6% +39.3%
hash2          │     6.356s      243ms    988ms │     6.298s      176ms    1.03s │ +0.9% +27.6%
rdoc           │    37.337s      2.42s    1.09s │    36.678s      2.11s    1.20s │ +1.8% +13.1%
binary_trees   │     3.366s      426ms    252ms │     3.082s      275ms    239ms │ +8.4% +35.4%
ring           │     5.252s       14ms    2.47s │     5.327s       12ms    2.43s │ -1.4% +14.3%
redblack       │     2.966s       28ms     41ms │     2.940s       21ms     38ms │ +0.9% +25.0%
───────────────┼────────────────────────────────┼────────────────────────────────┼──────────────────

Legend: (B) = Baseline, (E) = Experiment, Δ = improvement (positive = faster)
        Wall = total wallclock, Sweep = GC sweeping time, Mark = GC marking time
        Times are median of 3 runs

Application benchmarks (YJIT enabled):

--------------  -----------  ----------  ---------------  ----------  ------------------  -----------------
bench           master (ms)  stddev (%)  experiment (ms)  stddev (%)  experiment 1st itr  master/experiment
activerecord    132.5        0.9         132.5            1.0         1.056               1.001
chunky-png      577.2        0.4         580.1            0.4         0.994               0.995
erubi-rails     902.9        0.2         894.3            0.2         1.040               1.010
hexapdf         1763.9       3.3         1760.6           3.7         1.027               1.002
liquid-c        56.9         0.6         56.7             1.4         1.004               1.003
liquid-compile  46.3         2.1         46.1             2.1         1.005               1.004
liquid-render   77.8         0.8         75.1             0.9         1.023               1.036
mail            114.7        0.4         113.0            1.4         1.054               1.015
psych-load      1635.4       1.4         1625.9           0.5         0.988               1.006
railsbench      1685.4       2.4         1650.1           2.0         0.989               1.021
rubocop         133.5        8.1         130.3            7.8         1.002               1.024
ruby-lsp        140.3        1.9         137.5            1.8         1.007               1.020
sequel          64.6         0.7         63.9             0.7         1.003               1.011
shipit          1196.2       4.3         1181.5           4.2         1.003               1.012
--------------  -----------  ----------  ---------------  ----------  ------------------  -----------------

Legend:
- master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup.

Analysis

The GC benchmarks show sweep time improvements of 13-39% across workloads, with the largest gains in hash-heavy and tree-based benchmarks where many simple embedded objects are allocated. The hash1 benchmark shows nearly 40% faster sweeping, translating to 8.6% wall clock improvement.

Application benchmarks show modest but consistent 1-2% improvements, with railsbench showing a 2.1% speedup. The improvement is smaller for applications because sweep time is typically a small fraction of total execution time, but the gains are additive with other optimizations.

The ring benchmark shows a small regression in wall time despite faster sweeping, likely due to measurement noise or allocation pattern differences. The sweep improvement (+14.3%) is still present.

The conservative nature of the fast path means some objects that could theoretically skip cleanup will still use the slow path. For example, a T_STRING with generic instance variables added via instance_variable_set will use the slow path even though the string content itself is embedded. This is intentional - We'd rather miss an optimisation opportunity than cause a memory leak.

Updated by byroot (Jean Boussier) about 1 month ago Actions #1 [ruby-core:124613]

@eightbitraptor (Matt V-H) I personally like this patch and think it's a positive perf improvement with no user facing consequence. I don't see any downsides except that it makes developing Ruby a bit more complicated as you could forget to update this logic and cause memory leaks but seems acceptable.

What I wonder though is if you have an idea of how much time is spent in this routine? Because I think instead of checking for these various cases in there, we could also dedicate an object flag for that purpose, and have the various types responsible for keeping that flag updated.

Updated by eightbitraptor (Matt V-H) about 1 month ago · Edited Actions #2 [ruby-core:124615]

My original implementation of this idea had an extra bitmap used to determine whether each object needed a "full sweep" or not. I abandoned this idea partly because having to remember to manually update the flag and keep it in sync when the objects state changed was a lot of work and pretty error prone, and partly because the cache impacts of having to check a bitmap ended up hurting performance more than it helped.

I like the idea of a single flag in the header. I'd be interested to see if the cost of updating the flag negates the benefit of the fast path optimisation. I'll investigate, I didn't think we had any bits spare but it looks like we might.

I don't think this approach will cause memory leaks though because it's an allowlist - everything defaults to the slow path unless certain conditions are set. The worst thing is that objects that could be fast-tracked will be sent down the slow path.

Updated by eightbitraptor (Matt V-H) about 1 month ago Actions #3

  • Status changed from Open to Closed

Applied in changeset git|c21f3490d1f28b43564639ae8563bc2e02e828a4.


Implement a fast path for sweeping (gc_sweep_fast_path_p).

[Feature #21846]

There is a single path through our GC Sweeping code, and we always call
rb_gc_obj_free_vm_weak_references and rb_gc_obj_free before adding the
object back to the freelist.

We do this even when the object has no external resources that require
being free'd and has no weak references pointing to it.

This commit introduces a conservative fast path through gc_sweep_plane
that uses the object flags to identify certain cases where these calls
can be skipped - for these objects we just add them straight back on the
freelist. Any object for which gc_sweep_fast_path_p returns false will
use the current full sweep code (referred to here as the slow path).

Currently there are 2 checks that
will always require an object to go down the slow path:

  1. Has it's object_id been observed and stored in the id2ref_table
  2. Has it got generic ivars in the gen_fields table

If neither of these are true, then we run some flag checks on the object
and send the following cases down the fast path:

  • Objects that are not heap allocated
  • Embedded strings that aren't in the fstring table
  • Embedded Arrays
  • Embedded Hashes
  • Embedded Bignums
  • Embedded Strings
  • Floats, Rationals and Complex
  • Various IMEMO subtypes that do no allocation

We've benchmarked this code using ruby-bench as well as the gcbench
benchmarks inside Ruby (benchmarks/gc) and this patch results in a
modest speed improvement on almost all of the headline benchmarks (2% in
railsbench with YJIT enabled), and an observable 30% improvement in time
spent sweeping during the GC benchmarks:

master: ruby 4.1.0dev (2026-01-19T12:03:33Z master 859920dfd2) +YJIT +PRISM [x86_64-linux]
experiment: ruby 4.1.0dev (2026-01-16T21:36:46Z mvh-sweep-fast-pat.. c3ffe377a1) +YJIT +PRISM [x86_64-linux]

--------------  -----------  ----------  ---------------  ----------  ------------------  -----------------
bench           master (ms)  stddev (%)  experiment (ms)  stddev (%)  experiment 1st itr  master/experiment
lobsters        N/A          N/A         N/A              N/A         N/A                 N/A
activerecord    132.5        0.9         132.5            1.0         1.056               1.001
chunky-png      577.2        0.4         580.1            0.4         0.994               0.995
erubi-rails     902.9        0.2         894.3            0.2         1.040               1.010
hexapdf         1763.9       3.3         1760.6           3.7         1.027               1.002
liquid-c        56.9         0.6         56.7             1.4         1.004               1.003
liquid-compile  46.3         2.1         46.1             2.1         1.005               1.004
liquid-render   77.8         0.8         75.1             0.9         1.023               1.036
mail            114.7        0.4         113.0            1.4         1.054               1.015
psych-load      1635.4       1.4         1625.9           0.5         0.988               1.006
railsbench      1685.4       2.4         1650.1           2.0         0.989               1.021
rubocop         133.5        8.1         130.3            7.8         1.002               1.024
ruby-lsp        140.3        1.9         137.5            1.8         1.007               1.020
sequel          64.6         0.7         63.9             0.7         1.003               1.011
shipit          1196.2       4.3         1181.5           4.2         1.003               1.012
--------------  -----------  ----------  ---------------  ----------  ------------------  -----------------

Legend:
- experiment 1st itr: ratio of master/experiment time for the first benchmarking iteration.
- master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup.
Benchmark      │    Wall(B)   Sweep(B)  Mark(B) │    Wall(E)   Sweep(E)  Mark(E) │   Wall Δ  Sweep Δ
───────────────┼─────────────────────────────────┼─────────────────────────────────┼──────────────────
null           │     0.000s        1ms      4ms │     0.000s        1ms      4ms │       0%       0%
hash1          │     4.330s      875ms     46ms │     3.960s      531ms     44ms │ +8.6% +39.3%
hash2          │     6.356s      243ms    988ms │     6.298s      176ms    1.03s │ +0.9% +27.6%
rdoc           │    37.337s      2.42s    1.09s │    36.678s      2.11s    1.20s │ +1.8% +13.1%
binary_trees   │     3.366s      426ms    252ms │     3.082s      275ms    239ms │ +8.4% +35.4%
ring           │     5.252s       14ms    2.47s │     5.327s       12ms    2.43s │ -1.4% +14.3%
redblack       │     2.966s       28ms     41ms │     2.940s       21ms     38ms │ +0.9% +25.0%
───────────────┼─────────────────────────────────┼─────────────────────────────────┼──────────────────

Legend: (B) = Baseline, (E) = Experiment, Δ = improvement (positive = faster)
        Wall = total wallclock, Sweep = GC sweeping time, Mark = GC marking time
        Times are median of 3 runs

These results are also borne out when YJIT is disabled:

master: ruby 4.1.0dev (2026-01-19T12:03:33Z master 859920dfd2) +PRISM [x86_64-linux]
experiment: ruby 4.1.0dev (2026-01-16T21:36:46Z mvh-sweep-fast-pat.. c3ffe377a1) +PRISM [x86_64-linux]

--------------  -----------  ----------  ---------------  ----------  ------------------  -----------------
bench           master (ms)  stddev (%)  experiment (ms)  stddev (%)  experiment 1st itr  master/experiment
lobsters        N/A          N/A         N/A              N/A         N/A                 N/A
activerecord    389.6        0.3         377.5            0.3         1.032               1.032
chunky-png      1123.4       0.2         1109.2           0.2         1.013               1.013
erubi-rails     1754.3       0.1         1725.7           0.1         1.035               1.017
hexapdf         3346.5       0.9         3326.9           0.7         1.003               1.006
liquid-c        84.0         0.5         83.5             0.5         0.992               1.006
liquid-compile  74.0         1.5         73.5             1.4         1.011               1.008
liquid-render   199.9        0.4         199.6            0.4         1.000               1.002
mail            177.8        0.4         176.4            0.4         1.069               1.008
psych-load      2749.6       0.7         2777.0           0.0         0.980               0.990
railsbench      2983.0       1.0         2965.5           0.8         1.041               1.006
rubocop         228.8        1.0         227.5            1.2         1.015               1.005
ruby-lsp        221.8        0.9         216.1            0.8         1.011               1.026
sequel          89.1         0.5         89.1             1.8         1.005               1.000
shipit          2385.6       1.6         2371.8           1.0         1.002               1.006
--------------  -----------  ----------  ---------------  ----------  ------------------  -----------------

Legend:
- experiment 1st itr: ratio of master/experiment time for the first benchmarking iteration.
- master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup.
Benchmark      │    Wall(B)   Sweep(B)  Mark(B) │    Wall(E)   Sweep(E)  Mark(E) │   Wall Δ  Sweep Δ
───────────────┼─────────────────────────────────┼─────────────────────────────────┼──────────────────
null           │     0.000s        1ms      4ms │     0.000s        1ms      3ms │       0%       0%
hash1          │     4.349s      877ms     45ms │     4.045s      532ms     44ms │ +7.0% +39.3%
hash2          │     6.575s      235ms    967ms │     6.540s      181ms    1.04s │ +0.5% +23.0%
rdoc           │    45.782s      2.23s    1.14s │    44.925s      1.90s    1.01s │ +1.9% +15.0%
binary_trees   │     6.433s      426ms    252ms │     6.268s      278ms    240ms │ +2.6% +34.7%
ring           │     6.584s       17ms    2.33s │     6.738s       13ms    2.33s │ -2.3% +30.8%
redblack       │    13.334s       31ms     42ms │    13.296s       24ms    107ms │ +0.3% +22.6%
───────────────┼─────────────────────────────────┼─────────────────────────────────┼──────────────────

Legend: (B) = Baseline, (E) = Experiment, Δ = improvement (positive = faster)
        Wall = total wallclock, Sweep = GC sweeping time, Mark = GC marking time
        Times are median of 3 runs
Actions

Also available in: PDF Atom