Bug #21856
closedMassive performance degradation of `rb_obj_free` for `T_CLASS` since Ruby 4.0
Description
Loofah sanitization is noticeably slower
Ruby: 3.4.8
Loofah: 2.25.0
Nokogiri: 1.19.0
Iterations: 100000
user system total real
Loofah.fragment + scrub!(:prune) 26.091872 0.000000 26.091872 ( 25.110925)
Loofah.scrub_fragment(:prune) 25.913185 0.010392 25.923577 ( 24.948464)
Nokogiri HTML parse only 3.852690 0.000000 3.852690 ( 3.705930)
Ruby: 4.0.0 & 4.0.1
Loofah: 2.25.0
Nokogiri: 1.19.0
Iterations: 100000
user system total real
Loofah.fragment + scrub!(:prune) 38.094207 0.041753 38.135960 ( 36.669463)
Loofah.scrub_fragment(:prune) 40.168795 0.000045 40.168840 ( 38.561806)
Nokogiri HTML parse only 4.012936 0.052024 4.064960 ( 3.913272)
Ruby: 4.1.0 (ruby 4.1.0dev (2026-01-31T09:41:30Z master 7ef8c470d2) +PRISM [x86_64-linux])
Loofah: 2.25.0
Nokogiri: 1.19.0
Iterations: 100000
user system total real
Loofah.fragment + scrub!(:prune) 39.004228 0.000000 39.004228 ( 37.694873)
Loofah.scrub_fragment(:prune) 39.043199 0.031284 39.074483 ( 37.182785)
Nokogiri HTML parse only 3.889100 0.010427 3.899527 ( 3.741622)
Originally reported https://www.redmine.org/issues/43737
Files
Updated by byroot (Jean Boussier) about 2 months ago
I'm able to repro on my machine, even though the different isn't quite as bad (more like 30% slower).
Profile of ruby 3.4.7: https://share.firefox.dev/4rw3mv0
Profile of ruby 4.0.0: https://share.firefox.dev/4rtvrmt
The striking difference on the profile seem to be that 4.0 spends 28% of its time in remove_class_from_subclasses -> rb_classext_free_subclasses -> rb_iclass_classext_free -> rb_classext_foreach -> rb_obj_free.
A few notes:
- This codepath was changed a lot with the
Ruby::Boxintroduction, it may have become significantly slower. - It's surprising that we're sweeping lots of Class object, perhaps
LoofahorNokogiriare inadvertently allocating singleton classes in a hot spot?
Updated by byroot (Jean Boussier) about 2 months ago
I reduced the benchmark to:
# frozen_string_literal: true
require "bundler/inline"
gemfile do
source 'https://rubygems.org'
gem "benchmark-ips"
end
Benchmark.ips do |x|
x.report("singleton") do
Object.new.singleton_class
end
end
3.4.7:
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [arm64-darwin25]
Warming up --------------------------------------
singleton 742.338k i/100ms
Calculating -------------------------------------
singleton 7.381M (± 2.2%) i/s (135.48 ns/i) - 37.117M in 5.031106s
4.0.0
ruby 4.0.0 (2025-12-25 revision 553f1675f3) +PRISM [arm64-darwin25]
Warming up --------------------------------------
singleton 13.919k i/100ms
Calculating -------------------------------------
singleton 146.202k (±28.4%) i/s (6.84 μs/i) - 668.112k in 5.059563s
So that's a pretty massive regression in class sweeping. I'll see what I can do.
Updated by byroot (Jean Boussier) about 2 months ago
So the regression is indeed a consequence of the Box introduction.
When sweeping a Class, we need to remove the backreference from rb_classext_struct.box_super_subclasses and rb_classext_struct.box_module_subclasses, and for each one in involve multiple st_table lookups and updates, which is way more work that we used to have to do.
There might be a way to optimize this, but my understanding of how boxes are supposed to work is limited, so I don't know if I can fix it without breaking boxes.
Here again the solution might be to have a fast path for the overwhelming majority of classes that aren't impacted by boxes, but it would make the code way more complex.
Updated by byroot (Jean Boussier) about 2 months ago
- Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN to 3.2: DONTNEED, 3.3: DONTNEED, 3.4: DONTNEED, 4.0: REQUIRED
Updated by byroot (Jean Boussier) about 2 months ago
- Subject changed from Nokogiri performance degradation since Ruby 4.0 to Massive performance degradation of `rb_obj_free` for `T_CLASS` since Ruby 4.0
I spent some time trying to fix this, I think it's possible but is a pretty major refactoring.
In 3.4:
Classes have a subclasses doubly-linked list, which is necessary to be able to iterate subclasses efficiently.
As to be able to purge these list effectively, each class also keep a direct reference to the node than contain themselves in the parent linked list (subclass_entry).
They also have another linked list with all the module its been included on.
All this allows to efficiently remove all the references to a given class.
In 4.0:
It's roughly the same, except the 3 references above are all behind an extra st_table indirection. So before you can access any of these lists, you need to do an extra hash lookup.
To be very honest I don't understand why it is necessary, given these lists are inside rb_classext_t and from my understanding classes have one rb_classext_t per box, so that indirection seem redundant to me.
But then again, I don't understand the box design well, so I may be overlooking something, and I don't know if that's something I can reasonably fix.
Updated by jhawthorn (John Hawthorn) 23 days ago
· Edited
I think I've found a solution to this: we can return to the Ruby 3.4 O(1) removal, remove box/namespacing from it, and actually make it even simpler by skipping the CLASS -> ICLASS (and ICLASS -> ICLASS) relationship and directly associating T_CLASS with it's true T_CLASS "superclass".
Currently we maintain the subclasses list for two separate purposes (we essentially have two different relationships we're putting into the same list):
- On a T_MODULE, we track the T_ICLASSes created to include it into other classes. Used for method invalidation and propagating includes on the module that happen after it's been used
- On a T_CLASS/T_ICLASS, we track the T_CLASS/T_ICLASS which are the immediate children of the class. We use this for method invalidation, some cvar things, and to iterate through subclasses.
Purpose 1 does not have any issues with box, the T_ICLASS always belongs to one specific module and that's immutable. This list can be box-global (always use the prime classext or hoist it out) and only needs to be pruned during free. If we care about behaviour under a particular box (ie. the propagating includes), we should look up the current box being modified on the ICLASS itself.
Purpose 2 is more complicated. It currently tracks the immediate children, the T_CLASS or T_ICLASS whose super points back. Because super is per-box and is mutable (include/prepend insert ICLASSes into the chain) we need to update the list on include/prepend, entries must be per-box, and we can have multiple entries per-box. I propose we simplify this by no longer tracking the immediate subclass, but instead tracking the T_CLASS -> ... -> T_CLASS relationship, ie. the inverse of rb_class_superclass. That relationship is the same across all boxes and immutable after Class creation.
As a special case the ICLASS for refinements are also added to the purpose 2 list (on T_CLASS). As those ICLASS do not chain to an eventual leaf T_CLASS.
When we need to find the classes which have included a module, we can use the module subclasses list to find the ICLASS and then use RCLASS_INCLUDER. If we needed to iterate all T_ICLASS, we could then walk up the CLASS_SUPER chain, but I didn't find anywhere we needed to do that.
Updated by mar21 (Martin Halfar) 16 days ago
byroot (Jean Boussier) wrote in #note-1:
I'm able to repro on my machine, even though the different isn't quite as bad (more like 30% slower).
Profile of ruby 3.4.7: https://share.firefox.dev/4rw3mv0
Profile of ruby 4.0.0: https://share.firefox.dev/4rtvrmtThe striking difference on the profile seem to be that 4.0 spends 28% of its time in
remove_class_from_subclasses->rb_classext_free_subclasses->rb_iclass_classext_free->rb_classext_foreach->rb_obj_free.A few notes:
- This codepath was changed a lot with the
Ruby::Boxintroduction, it may have become significantly slower.- It's surprising that we're sweeping lots of Class object, perhaps
LoofahorNokogiriare inadvertently allocating singleton classes in a hot spot?
We have just encountered this issue ourselves and while it is indeed not quite bad on smaller samples, it quickly scales up with larger (to huge) samples (in our case 2.5MB html document).
- On Ruby 3.4.9 and below it took <1s to sanitize.
- On Ruby 4.0.X immediately after startup and/or GC cleanup ~2s. Second run varies 40s-100s. Third run couple of minutes.
Updated by mdalessio (Mike Dalessio) 16 days ago
👋 I'm the maintainer of Nokogiri and Loofah.
Nokogiri has a "decorator" feature that allows modules to be mixed into node objects dynamically. When a document has decorators registered, Nokogiri calls node.extend(mod) on each Node object when it is created. This creates a singleton class per Node (though not necessarily for every DOM node -- only the ones for which we create a Ruby object).
Loofah uses this feature. It defines a DocumentDecorator module included in all of its document classes (XML, HTML4, HTML5), which registers ScrubBehavior::Node as a decorator for every Nokogiri::XML::Node. Loofah traverses the DOM tree and instantiates a Ruby Node object for every DOM node. This means every node in every Loofah-parsed document gets a singleton class allocated — one per node.
Given the fact that Rails uses Loofah to sanitize content, this performance problem is likely to compound badly for Rails applications when sanitizing large HTML pages.
Updated by byroot (Jean Boussier) 16 days ago
Nokogiri has a "decorator" feature
I figured as much when I investigated the problem. Now that John found a fix, it will probably be back to OK performance in 4.0.3, but I think it would be valuable to find a better API in Nokogiri. In theory it should be possible to provide a subclass of Nokogiri::Node for nokogiri to allocate nodes with, instead of a module to extend the instances with.
That would offer an almost as flexible feature but with much better performance.
Updated by jhawthorn (John Hawthorn) 16 days ago
- Status changed from Open to Closed
I merged my proposed design, which should restore Ruby 3.4's performance https://github.com/ruby/ruby/pull/16363 (forgot to tag this issue in the commit message, sorry)
I'm going to look into backporting this to Ruby 4.0. We usually don't backport performance improvements, but I think we should make an exception.
Updated by byroot (Jean Boussier) 16 days ago
We usually don't backport performance improvements, but I think we should make an exception.
IMO the performance degradation is severe enough that I consider it a bug, and that is why I marked it for backport. So I don't really see this as an exception.
Updated by nonster (Nony Dutton) 15 days ago
byroot (Jean Boussier) wrote in #note-11:
We usually don't backport performance improvements, but I think we should make an exception.
IMO the performance degradation is severe enough that I consider it a bug, and that is why I marked it for backport. So I don't really see this as an exception.
This bit us so we downgraded back to Ruby 3.4. Thanks everyone for the fix!