Actions

Copy link

Bug #20301

open

`Set#add?` does two hash look-ups

Added by AMomchilov (Alexander Momchilov) 8 months ago. Updated 8 months ago.

Status:

Open

Assignee:

Target version:

ruby -v:

Backport:

3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN

[ruby-core:116941]

Description

A common usage of Sets is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.:

SEEN_VALUES = Set.new
	
def receive_value(value)
	if SEEN_VALUES.add?(value)
		puts "Saw #{value} for the first time."
	else
		puts "Already seen #{value}, ignoring."
	end
end

receive_value(1) # Saw 1 for the first time.
receive_value(2) # Saw 2 for the first time.
receive_value(3) # Saw 3 for the first time.
receive_value(1) # Already seen 1, ignoring.

Readers might reasonably assume that add? is only looking up into the set a single time, but it's actually doing two separate look-ups! (source)

class Set
  def add?(o
    # 1. `include?(o)` looks up into `@hash`
    # 2. if the value isn't there, `add(o)` does a second look-up into `@hash`
    add(o) unless include?(o)
  end
end

This gets especially expensive if the values are large hash/arrays/objects, whose #hash is expensive to compute.

We can optimize this if it was possible to set a value in hash, and retrieve the value that was already there, in a single go. I propose adding Hash#exchange_value to do exactly that. If that existed, we can re-implement #add? as:

class Set
  def add?(o)
    # Only requires a single look-up into `@hash`!
    self unless @hash.exchange_value(o, true)
  end

Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093

Theory¶

How much of a benefit this has depends on 2 factors:

How much #hash is called, which depends on how many new objects are added to the set.
- If every object is new, then #hash used to be called twice on every #add?.
  - This is where this improvement makes the biggest (2x!) change.
- If every object has already been seen, then #hash was never being called twice before anyway, so there would be no improvement.
  - It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set.
- Every other case lies somewhere in between those two, depending on the % of objects which are new.
How slow #hash is to compute for the key
- If the hash is slow to compute, this change will make a bigger improvement
- If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact.

Benchmark summary¶

	All objects are new	All objects are preexisting
objects with slow `#hash`	100.0%	~0.0%
objects with fast `#hash`	24.5%	4.6%

As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't.

For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by AMomchilov (Alexander Momchilov) 8 months ago

Description updated (diff)

Actions

Copy link

Updated by AMomchilov (Alexander Momchilov) 8 months ago

Description updated (diff)

Actions

Copy link

#3 [ruby-core:116952]

Updated by Dan0042 (Daniel DeLorme) 8 months ago · Edited

Now I understand why you proposed #20300 Hash#update_value

However I'd like to suggest an alternative approach for your consideration:

def add?(k)
  added = false
  @hash.add(k){ added = true } #call block only if k not in @hash; return existing or added value
  self if added
end

This is likely to be a bit less efficient than your approach, however Hash#add is a method I've been missing from ruby for a long time, and would find infinitely more useful than Hash#update_value

Actions

Copy link

#4 [ruby-core:116955]

Updated by Eregon (Benoit Daloze) 8 months ago

@Dan0042 (Daniel DeLorme) That's related to #17342 then, and also known as compute_if_absent in concurrent-ruby.

Actions

Copy link

Updated by Eregon (Benoit Daloze) 8 months ago

Related to Feature #20300: Hash: set value and get pre-existing value in one call added

Actions

Copy link

#6 [ruby-core:116959]

Updated by AMomchilov (Alexander Momchilov) 8 months ago

I don't mind it @Dan0042 (Daniel DeLorme), but that's a secondary issue IMO. The block call defeats the benefit of this optimization. It'll even slow down the case where you're looking up pre-existing objects (that's currently net-even perf after these changes), and that's a big no-no.

Actions

Copy link

Updated by AMomchilov (Alexander Momchilov) 8 months ago

Description updated (diff)

Actions

Copy link

#8 [ruby-core:117138]

Updated by shyouhei (Shyouhei Urabe) 8 months ago

Why not:

def add?(o)
  n = size
  add(o)
  m = size
  return n == m ? self : nil
end

This implementation involves only one hash lookup.

Actions

Copy link

#9 [ruby-core:117141]

Updated by nobu (Nobuyoshi Nakada) 8 months ago

shyouhei (Shyouhei Urabe) wrote in #note-8:

  return n == m ? self : nil

The return value is inverse.

Actions

Copy link

#10 [ruby-core:117142]

Updated by shyouhei (Shyouhei Urabe) 8 months ago

nobu (Nobuyoshi Nakada) wrote in #note-9:

shyouhei (Shyouhei Urabe) wrote in #note-8:
  return n == m ? self : nil
The return value is inverse.

My bad. Thank you for correction.

Actions

Copy link

#11 [ruby-core:117177]

Updated by Eregon (Benoit Daloze) 8 months ago

That implementation using size is not thread-safe, even on CRuby AFAIK.
For example, if T2 calls add? with a new element while T1 calls add? with an existing element.
If T1 is just before m = size when T2 executes add(o), then both threads return "element added" but T1 did not add an element (incorrect result).

The original code with two lookups does not have that race condition.
However it can have the race condition that two threads adding the same new element both return "element added".
Hash#exchange_value would fix that.

Actions

Copy link

#12 [ruby-core:117190]

Updated by shyouhei (Shyouhei Urabe) 8 months ago

Yes. add(o) unless include?(o) isn't thread safe already. My implementation just doesn't care to improve that.

Actions

Copy link

#13 [ruby-core:117192]

Updated by AMomchilov (Alexander Momchilov) 8 months ago

shyouhei (Shyouhei Urabe) wrote in #note-8:

Why not:

Because I didn't think of that :)

I would be okay with it, but I think the thread safety issue is also worth solving. The implementation I'm proposing solves both the performance and safety problems.

Actions

Copy link

Also available in: Atom PDF

Like2

Like0Like0Like0Like1Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby » Ruby master

Custom queries

Bug #20301

`Set#add?` does two hash look-ups

Theory¶

Benchmark summary¶

Updated by AMomchilov (Alexander Momchilov) 8 months ago

Updated by AMomchilov (Alexander Momchilov) 8 months ago

Updated by Dan0042 (Daniel DeLorme) 8 months ago · Edited

Updated by Eregon (Benoit Daloze) 8 months ago

Updated by Eregon (Benoit Daloze) 8 months ago

Updated by AMomchilov (Alexander Momchilov) 8 months ago

Updated by AMomchilov (Alexander Momchilov) 8 months ago

Updated by shyouhei (Shyouhei Urabe) 8 months ago

Updated by nobu (Nobuyoshi Nakada) 8 months ago

Updated by shyouhei (Shyouhei Urabe) 8 months ago

Updated by Eregon (Benoit Daloze) 8 months ago

Updated by shyouhei (Shyouhei Urabe) 8 months ago

Updated by AMomchilov (Alexander Momchilov) 8 months ago