Project

General

Profile

Actions

Bug #20301

open

`Set#add?` does two hash look-ups

Added by AMomchilov (Alexander Momchilov) 2 months ago. Updated about 1 month ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:116941]

Description

A common usage of Sets is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.:

SEEN_VALUES = Set.new
	
def receive_value(value)
	if SEEN_VALUES.add?(value)
		puts "Saw #{value} for the first time."
	else
		puts "Already seen #{value}, ignoring."
	end
end

receive_value(1) # Saw 1 for the first time.
receive_value(2) # Saw 2 for the first time.
receive_value(3) # Saw 3 for the first time.
receive_value(1) # Already seen 1, ignoring.

Readers might reasonably assume that add? is only looking up into the set a single time, but it's actually doing two separate look-ups! (source)

class Set
  def add?(o
    # 1. `include?(o)` looks up into `@hash`
    # 2. if the value isn't there, `add(o)` does a second look-up into `@hash`
    add(o) unless include?(o)
  end
end

This gets especially expensive if the values are large hash/arrays/objects, whose #hash is expensive to compute.

We can optimize this if it was possible to set a value in hash, and retrieve the value that was already there, in a single go. I propose adding Hash#exchange_value to do exactly that. If that existed, we can re-implement #add? as:

class Set
  def add?(o)
    # Only requires a single look-up into `@hash`!
    self unless @hash.exchange_value(o, true)
  end

Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093

Theory

How much of a benefit this has depends on 2 factors:

  1. How much #hash is called, which depends on how many new objects are added to the set.
    • If every object is new, then #hash used to be called twice on every #add?.
      • This is where this improvement makes the biggest (2x!) change.
    • If every object has already been seen, then #hash was never being called twice before anyway, so there would be no improvement.
      • It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set.
    • Every other case lies somewhere in between those two, depending on the % of objects which are new.
  2. How slow #hash is to compute for the key
    • If the hash is slow to compute, this change will make a bigger improvement
    • If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact.

Benchmark summary

All objects are new All objects are preexisting
objects with slow #hash 100.0% ~0.0%
objects with fast #hash 24.5% 4.6%

As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't.

For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093


Related issues 1 (1 open0 closed)

Related to Ruby master - Feature #20300: Hash: set value and get pre-existing value in one callOpenActions
Actions #1

Updated by AMomchilov (Alexander Momchilov) 2 months ago

  • Description updated (diff)
Actions #2

Updated by AMomchilov (Alexander Momchilov) 2 months ago

  • Description updated (diff)

Updated by Dan0042 (Daniel DeLorme) 2 months ago ยท Edited

Now I understand why you proposed #20300 Hash#update_value

However I'd like to suggest an alternative approach for your consideration:

def add?(k)
  added = false
  @hash.add(k){ added = true } #call block only if k not in @hash; return existing or added value
  self if added
end

This is likely to be a bit less efficient than your approach, however Hash#add is a method I've been missing from ruby for a long time, and would find infinitely more useful than Hash#update_value

Updated by Eregon (Benoit Daloze) 2 months ago

@Dan0042 (Daniel DeLorme) That's related to #17342 then, and also known as compute_if_absent in concurrent-ruby.

Actions #5

Updated by Eregon (Benoit Daloze) 2 months ago

  • Related to Feature #20300: Hash: set value and get pre-existing value in one call added

Updated by AMomchilov (Alexander Momchilov) 2 months ago

I don't mind it @Dan0042 (Daniel DeLorme), but that's a secondary issue IMO. The block call defeats the benefit of this optimization. It'll even slow down the case where you're looking up pre-existing objects (that's currently net-even perf after these changes), and that's a big no-no.

Actions #7

Updated by AMomchilov (Alexander Momchilov) about 2 months ago

  • Description updated (diff)

Updated by shyouhei (Shyouhei Urabe) about 1 month ago

Why not:

def add?(o)
  n = size
  add(o)
  m = size
  return n == m ? self : nil
end

This implementation involves only one hash lookup.

Updated by nobu (Nobuyoshi Nakada) about 1 month ago

shyouhei (Shyouhei Urabe) wrote in #note-8:

  return n == m ? self : nil

The return value is inverse.

Updated by shyouhei (Shyouhei Urabe) about 1 month ago

nobu (Nobuyoshi Nakada) wrote in #note-9:

shyouhei (Shyouhei Urabe) wrote in #note-8:

  return n == m ? self : nil

The return value is inverse.

My bad. Thank you for correction.

Updated by Eregon (Benoit Daloze) about 1 month ago

That implementation using size is not thread-safe, even on CRuby AFAIK.
For example, if T2 calls add? with a new element while T1 calls add? with an existing element.
If T1 is just before m = size when T2 executes add(o), then both threads return "element added" but T1 did not add an element (incorrect result).

The original code with two lookups does not have that race condition.
However it can have the race condition that two threads adding the same new element both return "element added".
Hash#exchange_value would fix that.

Updated by shyouhei (Shyouhei Urabe) about 1 month ago

Yes. add(o) unless include?(o) isn't thread safe already. My implementation just doesn't care to improve that.

Updated by AMomchilov (Alexander Momchilov) about 1 month ago

shyouhei (Shyouhei Urabe) wrote in #note-8:

Why not:

Because I didn't think of that :)

I would be okay with it, but I think the thread safety issue is also worth solving. The implementation I'm proposing solves both the performance and safety problems.

Actions

Also available in: Atom PDF

Like2
Like0Like0Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0