Project

General

Profile

Bug #20301

Updated by AMomchilov (Alexander Momchilov) 9 months ago

A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: 

 ```ruby 
 SEEN_VALUES = Set.new 
	
 def receive_value(value) 
	 if SEEN_VALUES.add?(value) 
		 puts "Saw #{value} for the first time." 
	 else 
		 puts "Already seen #{value}, ignoring." 
	 end 
 end 

 receive_value(1) # Saw 1 for the first time. 
 receive_value(2) # Saw 2 for the first time. 
 receive_value(3) # Saw 3 for the first time. 
 receive_value(1) # Already seen 1, ignoring. 
 ``` 

 Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) 

 ```rb 
 class Set 
   def add?(o 
     # 1. `include?(o)` looks up into `@hash` 
     # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` 
     add(o) unless include?(o) 
   end 
 end 
 ``` 

 This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. 

 We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#exchange_value` `Hash#update_value` to do exactly that. If that existed, we can re-implement `#add?` as: 

 ```rb 
 class Set 
   def add?(o) 
     # Only requires a single look-up into `@hash`! 
     self unless @hash.exchange_value(o, @hash.update_value(o, true) 
   end 
 ``` 

 Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 

 # Theory 

 How much of a benefit this has depends on 2 factors: 

 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. 
     * If every object is new, then `#hash` used to be called twice on every `#add?`. 
         * This is where this improvement makes the biggest (2x!) change. 
     * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. 
         * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. 
     * Every other case lies somewhere in between those two, depending on the % of objects which are new. 
 2. How slow `#hash` is to compute for the key 
     * If the hash is slow to compute, this change will make a bigger improvement 
     * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. 

 # Benchmark summary 

 |                             | All objects are new | All objects are preexisting | 
 |---------------------------|-------:|------:| 
 | objects with slow `#hash` | 100.0% | ~0.0% | 
 | objects with fast `#hash` |    24.5% |    4.6% | 

 As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. 

 For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093

Back