Bug #21856
openMassive performance degradation of `rb_obj_free` for `T_CLASS` since Ruby 4.0
Description
Loofah sanitization is noticeably slower
Ruby: 3.4.8
Loofah: 2.25.0
Nokogiri: 1.19.0
Iterations: 100000
user system total real
Loofah.fragment + scrub!(:prune) 26.091872 0.000000 26.091872 ( 25.110925)
Loofah.scrub_fragment(:prune) 25.913185 0.010392 25.923577 ( 24.948464)
Nokogiri HTML parse only 3.852690 0.000000 3.852690 ( 3.705930)
Ruby: 4.0.0 & 4.0.1
Loofah: 2.25.0
Nokogiri: 1.19.0
Iterations: 100000
user system total real
Loofah.fragment + scrub!(:prune) 38.094207 0.041753 38.135960 ( 36.669463)
Loofah.scrub_fragment(:prune) 40.168795 0.000045 40.168840 ( 38.561806)
Nokogiri HTML parse only 4.012936 0.052024 4.064960 ( 3.913272)
Ruby: 4.1.0 (ruby 4.1.0dev (2026-01-31T09:41:30Z master 7ef8c470d2) +PRISM [x86_64-linux])
Loofah: 2.25.0
Nokogiri: 1.19.0
Iterations: 100000
user system total real
Loofah.fragment + scrub!(:prune) 39.004228 0.000000 39.004228 ( 37.694873)
Loofah.scrub_fragment(:prune) 39.043199 0.031284 39.074483 ( 37.182785)
Nokogiri HTML parse only 3.889100 0.010427 3.899527 ( 3.741622)
Originally reported https://www.redmine.org/issues/43737
Files
Updated by byroot (Jean Boussier) 18 days ago
I'm able to repro on my machine, even though the different isn't quite as bad (more like 30% slower).
Profile of ruby 3.4.7: https://share.firefox.dev/4rw3mv0
Profile of ruby 4.0.0: https://share.firefox.dev/4rtvrmt
The striking difference on the profile seem to be that 4.0 spends 28% of its time in remove_class_from_subclasses -> rb_classext_free_subclasses -> rb_iclass_classext_free -> rb_classext_foreach -> rb_obj_free.
A few notes:
- This codepath was changed a lot with the
Ruby::Boxintroduction, it may have become significantly slower. - It's surprising that we're sweeping lots of Class object, perhaps
LoofahorNokogiriare inadvertently allocating singleton classes in a hot spot?
Updated by byroot (Jean Boussier) 18 days ago
I reduced the benchmark to:
# frozen_string_literal: true
require "bundler/inline"
gemfile do
source 'https://rubygems.org'
gem "benchmark-ips"
end
Benchmark.ips do |x|
x.report("singleton") do
Object.new.singleton_class
end
end
3.4.7:
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [arm64-darwin25]
Warming up --------------------------------------
singleton 742.338k i/100ms
Calculating -------------------------------------
singleton 7.381M (± 2.2%) i/s (135.48 ns/i) - 37.117M in 5.031106s
4.0.0
ruby 4.0.0 (2025-12-25 revision 553f1675f3) +PRISM [arm64-darwin25]
Warming up --------------------------------------
singleton 13.919k i/100ms
Calculating -------------------------------------
singleton 146.202k (±28.4%) i/s (6.84 μs/i) - 668.112k in 5.059563s
So that's a pretty massive regression in class sweeping. I'll see what I can do.
Updated by byroot (Jean Boussier) 18 days ago
So the regression is indeed a consequence of the Box introduction.
When sweeping a Class, we need to remove the backreference from rb_classext_struct.box_super_subclasses and rb_classext_struct.box_module_subclasses, and for each one in involve multiple st_table lookups and updates, which is way more work that we used to have to do.
There might be a way to optimize this, but my understanding of how boxes are supposed to work is limited, so I don't know if I can fix it without breaking boxes.
Here again the solution might be to have a fast path for the overwhelming majority of classes that aren't impacted by boxes, but it would make the code way more complex.
Updated by byroot (Jean Boussier) 18 days ago
- Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN to 3.2: DONTNEED, 3.3: DONTNEED, 3.4: DONTNEED, 4.0: REQUIRED
Updated by byroot (Jean Boussier) 18 days ago
- Subject changed from Nokogiri performance degradation since Ruby 4.0 to Massive performance degradation of `rb_obj_free` for `T_CLASS` since Ruby 4.0
I spent some time trying to fix this, I think it's possible but is a pretty major refactoring.
In 3.4:
Classes have a subclasses doubly-linked list, which is necessary to be able to iterate subclasses efficiently.
As to be able to purge these list effectively, each class also keep a direct reference to the node than contain themselves in the parent linked list (subclass_entry).
They also have another linked list with all the module its been included on.
All this allows to efficiently remove all the references to a given class.
In 4.0:
It's roughly the same, except the 3 references above are all behind an extra st_table indirection. So before you can access any of these lists, you need to do an extra hash lookup.
To be very honest I don't understand why it is necessary, given these lists are inside rb_classext_t and from my understanding classes have one rb_classext_t per box, so that indirection seem redundant to me.
But then again, I don't understand the box design well, so I may be overlooking something, and I don't know if that's something I can reasonably fix.