Bug #20301
open`Set#add?` does two hash look-ups
Description
A common usage of Set
s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.:
SEEN_VALUES = Set.new
def receive_value(value)
if SEEN_VALUES.add?(value)
puts "Saw #{value} for the first time."
else
puts "Already seen #{value}, ignoring."
end
end
receive_value(1) # Saw 1 for the first time.
receive_value(2) # Saw 2 for the first time.
receive_value(3) # Saw 3 for the first time.
receive_value(1) # Already seen 1, ignoring.
Readers might reasonably assume that add?
is only looking up into the set a single time, but it's actually doing two separate look-ups! (source)
class Set
def add?(o
# 1. `include?(o)` looks up into `@hash`
# 2. if the value isn't there, `add(o)` does a second look-up into `@hash`
add(o) unless include?(o)
end
end
This gets especially expensive if the values are large hash/arrays/objects, whose #hash
is expensive to compute.
We can optimize this if it was possible to set a value in hash, and retrieve the value that was already there, in a single go. I propose adding Hash#exchange_value
to do exactly that. If that existed, we can re-implement #add?
as:
class Set
def add?(o)
# Only requires a single look-up into `@hash`!
self unless @hash.exchange_value(o, true)
end
Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093
Theory¶
How much of a benefit this has depends on 2 factors:
- How much
#hash
is called, which depends on how many new objects are added to the set.- If every object is new, then
#hash
used to be called twice on every#add?
.- This is where this improvement makes the biggest (2x!) change.
- If every object has already been seen, then
#hash
was never being called twice before anyway, so there would be no improvement.- It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set.
- Every other case lies somewhere in between those two, depending on the % of objects which are new.
- If every object is new, then
- How slow
#hash
is to compute for the key- If the hash is slow to compute, this change will make a bigger improvement
- If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact.
Benchmark summary¶
All objects are new | All objects are preexisting | |
---|---|---|
objects with slow #hash
|
100.0% | ~0.0% |
objects with fast #hash
|
24.5% | 4.6% |
As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't.
For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093