Project

General

Profile

Feature #17342

Hash#fetch_set

Added by MaxLap (Maxime Lapointe) about 2 months ago. Updated about 1 month ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:101232]

Description

I would like to propose adding the fetch_set method to Hash. It behaves just like fetch, but when using the default value (2nd argument or the block), it also sets the value in the Hash for the given key.

We often use the pattern cache[key] ||= calculation. This pattern however has a problem when the calculation could return false or nil, as in those case, the calculation is repeated each time.

I believe the best practice in that case is:

cache.fetch(key) { cache[key] = calculation }

With my suggestion, it would be:

cache.fetch_set(key) { calculation }

In these examples, each part is very short, so the fetch case is still clean. But as each part gets longer, the need to repeat cache[key] becomes more friction.

Here is a more realistic example:

# Also using the key argument to the block to avoid repeating the
# long symbol, adding some indirection
RequestStore.store.fetch(:monitor_value_is_delayed?) do |key|
  RequestStore.store[key] = !MonitorValue.where('date >= ?', Time.now - 5.minutes).exists?
end

RequestStore.store.fetch_set(:monitor_value_is_delayed?) do
  !MonitorValue.where('date >= ?', Time.now - 5.minutes).exists?
end

There is a precedent for such a method: Python has it, but with a quite confusing name: setdefault(key, default_value). This does not set a default for the whole dictionary as the name would make you think, it really just does what is proposed here. https://docs.python.org/3/library/stdtypes.html#dict.setdefault

Updated by chrisseaton (Chris Seaton) about 2 months ago

Thanks I've always wanted this feature. Whenever I write cache.fetch(key) { cache[key] = calculation } (or more often cache.fetch(key) { |k| cache[k] = calculation }) I think 'surely this pattern is worth making simpler.'

Updated by duerst (Martin Dürst) about 2 months ago

I think the feature in general is okay, but I have two concerns:

1) The name very easily suggests that the method is fetching a set, rather than fetching and setting at the same time.

2) Why do we need a block for the second parameter? Can't that just be an ordinary parameter?

Updated by chrisseaton (Chris Seaton) about 2 months ago

#fetch_or_set could be a good name.

Why do we need a block for the second parameter? Can't that just be an ordinary parameter?

Because we don't want to do the work of calculating the initial value if it isn't needed because the value is already set.

Updated by phluid61 (Matthew Kerwin) about 2 months ago

chrisseaton (Chris Seaton) wrote in #note-3:

#fetch_or_set could be a good name.

Why do we need a block for the second parameter? Can't that just be an ordinary parameter?

Because we don't want to do the work of calculating the initial value if it isn't needed because the value is already set.

I can see the utilitiy, it would be good if it had a similar signature to #fetch :

hsh = {}
hsh.fetch_or_set(:a, 1) # => 1, hsh = {:a => 1}
hsh.fetch_or_set(:b) {|key| key.to_s } # => "b", hsh = {:a => 1, :b => "b"}

The only reason I don't suggest it as a keyword argument to #fetch is because of the edge of case of including the kwarg and not including the default value/block, otherwise hsh.fetch(:a, 1, store_default: true) would be okay.

Updated by MaxLap (Maxime Lapointe) about 2 months ago

I didn't put an example using 2 parameters instead of a block, but yes, that option is also available. It has the same signature as a normal fetch would. In the examples, calculation was meant to be a complex thing, possibly a method call, which we want to avoid doing if the key is already in the Hash.

I understand the idea that it could be about fetching a set, but I don't see what set that would be. So it doesn't feel ambiguous to me. I did consider the fetch_or_set alternative, but core methods normally prefer to be shorter when the general idea is pretty clear. But it's not a bad name and I would still be happy to have the feature if that was it.

Updated by nobu (Nobuyoshi Nakada) about 2 months ago

XXX_or_YYY doesn't feel a good name in general.
Why does it need to be a method of Hash, and built-in?

Why not a separate class?

class Cache < Hash
  def fetch(key, &block)
    super(key) {self[key] = yield(key)}
  end
end

Updated by jbeschi (jacopo beschi) about 2 months ago

fetch_set mixes the concept of query with the concept of command and I think it's not a good approach.

Moreover I don't really see any big advantage in writing

cache.fetch_set(key) { calculation }

vs

cache.fetch(key) { cache[key] = calculation }

Therefore I don't understand why we should add it to the language.

Updated by Eregon (Benoit Daloze) about 2 months ago

Another name for this is compute_if_absent.
It's notably the name used in concurrent-ruby:
http://ruby-concurrency.github.io/concurrent-ruby/1.1.5/Concurrent/Map.html#compute_if_absent-instance_method

And in Java, where it's even defined on Map (not just ConcurrentMap):
https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#computeIfAbsent-K-java.util.function.Function-

Having it as a built-in method, it also makes it possible to avoid computing the key's #hash twice, and potentially avoid redoing the lookup in the Hash.

Another way to do this pattern is to use the default block:

RequestStore.store = Hash.new do |h, k|
  h[k] = !MonitorValue.where('date >= ?', Time.now - 5.minutes).exists?
end

RequestStore.store[:monitor_value_is_delayed?]

Which already works fine.
And it has the advantage that if multiple places want to read from the Hash they don't have to repeat the code.
Is there a case this pattern wouldn't work and where Hash#fetch_set would work?

This pattern can be made to work with parallelism too, see Idiomatic Concurrent Hash Operations, page 83.

Regarding concurrency and parallelism, we need to define the semantics if we add this method.

Of course, the assignment should not be performed if there is already a key, it must be "put if absent" semantics
(cache.fetch(key) { cache[key] = calculation } is actually breaking that).

The question is whether the given block can be executed multiple times for a given key.
If not, it requires synchronization while calling the block, which can lead to deadlocks.
If yes, it doesn't require synchronization while calling the block which seems safer, but it means the block can be called multiple times.

Updated by MaxLap (Maxime Lapointe) about 2 months ago

I forgot to mention, but this pattern can also be more performant, as the key only needs to be hashed once for both the fetching and setting part. It's minor, but I do think of it everything I write the fetch pattern, and when the key is complex, this could be meaningful. (Dang, Eregon beat me to this by 2 minutes!)


nobu (Nobuyoshi Nakada) wrote in #note-6:

Why not a separate class?

class Cache < Hash
  def fetch(key, &block)
    super(key) {self[key] = yield(key)}
  end
end

The Hash that we are using is not always under our control. In my second example, I'm using the request_store gem, which exposes as Hash (RequestStore.store), as cache. It would be quite dirty to monkey patch this. I could add that custom Hash in the store, but that would be losing the cleaner and shorter code benefits that I'm trying to achieve with this feature request.

Doing this class is also less performant instead of being more. (This is a very minor point)

Also: If I see somewhere cache.fetch(key) { calculation } I will instantly be worried:

  • Did the person forget to set the key?
  • Is it only set elsewhere? => Oh wait, it's a custom class that does something different.
  • Will one of my colleague copy this pattern without remembering that this is a different class?

So if I was to make a different class, I would still use a different name just to avoid the frictions it would cause.


jbeschi (jacopo beschi) wrote in #note-7:

fetch_set mixes the concept of query with the concept of command and I think it's not a good approach.

Maybe in name it appears so, but not in spirit. All this does is set a value if one isn't already there, and in common Ruby spirit, it returns something that can be useful, which is the value at the key.

This is where the Python name for the function comes from, setdefault sets a value "by default" for a single key, and is often used to replace cache[key] ||= [] since Python doesn't support this syntax.


Eregon (Benoit Daloze) wrote in #note-8:

Another name for this is compute_if_absent.

True, however my thinking is this could also be used with a 2nd argument, like fetch, in which case there is no "computing" to speak of.

Another way to do this pattern is to use the default block:

RequestStore.store = Hash.new do |h, k|
  h[k] = !MonitorValue.where('date >= ?', Time.now - 5.minutes).exists?
end

RequestStore.store[:monitor_value_is_delayed?]

Which already works fine.
And it has the advantage that if multiple places want to read from the Hash they don't have to repeat the code.
Is there a case this pattern wouldn't work and where Hash#fetch_set would work?

RequestStore is used for lots of different things, setting a default like that means that if I use it for RequestStore.store[:last_count], and that wasn't set, then I would instead be doing this MonitorValue check. And I can only use your pattern once per Hash.

This pattern can be made to work with parallelism too, see Idiomatic Concurrent Hash Operations, page 83.

Regarding concurrency and parallelism, we need to define the semantics if we add this method.

Of course, the assignment should not be performed if there is already a key, it must be "put if absent" semantics
(cache.fetch(key) { cache[key] = calculation } is actually breaking that).

I don't understand what you mean, how is it breaking that? You mean in the case of threading, where there could be 2 assignments if 2 threads go in at the same time? I don't think it's the job of the Hash to deal with this.

The question is whether the given block can be executed multiple times for a given key.
If not, it requires synchronization while calling the block, which can lead to deadlocks.
If yes, it doesn't require synchronization while calling the block which seems safer, but it means the block can be called multiple times.

I don't consider Hash to be a concurrency primitive in Ruby. So I wouldn't put any synchronization here. If synchronization is needed, it can be done from inside the block (or the block of the classic fetch pattern).

Updated by Eregon (Benoit Daloze) about 2 months ago

MaxLap (Maxime Lapointe) wrote in #note-9:

The Hash that we are using is not always under our control. In my second example, I'm using the request_store gem, which exposes as Hash (RequestStore.store), as cache.

I see, https://github.com/steveklabnik/request_store, so you might want a different block per key essentially, and so a single default block isn't enough.

It would be quite dirty to monkey patch this.

There is Hash#default_proc= but it would be dirty in this case indeed.

I don't understand what you mean, how is it breaking that? You mean in the case of threading, where there could be 2 assignments if 2 threads go in at the same time? I don't think it's the job of the Hash to deal with this.

Yes, exactly.
Many gems assume Hash is thread-safe, so in my opinion it needs to be thread-safe, and it is at least in CRuby.
Even without threads you need to consider cases like the block itself would already set the key, and you shouldn't set it again.

I don't consider Hash to be a concurrency primitive in Ruby. So I wouldn't put any synchronization here. If synchronization is needed, it can be done from inside the block (or the block of the classic fetch pattern).

This method fits very clearly for caches, and caches are typically used for multiple thread, so I think we should consider the concurrency semantics.
Concurrency can also happen for Hash when mutating while iterating, or some methods can be called recursively, so even without threads it needs to be considered.
Agreed no synchronization when calling the block is better, and the block can make its own synchronization to avoid running twice per key if really needed.

On CRuby, the GIL will be enough to do a check if the key is already set just before setting it from the result of the block, but the code needs to be careful there is no Ruby call between that check and assignment, or it would break with multiple threads.

Updated by nobu (Nobuyoshi Nakada) about 2 months ago

Eregon (Benoit Daloze) wrote in #note-8:

Having it as a built-in method, it also makes it possible to avoid computing the key's #hash twice, and potentially avoid redoing the lookup in the Hash.

It is true ideally, but no one can guarantee the hash value never change.


MaxLap (Maxime Lapointe) wrote in #note-9:

The Hash that we are using is not always under our control. In my second example, I'm using the request_store gem, which exposes as Hash (RequestStore.store), as cache. It would be quite dirty to monkey patch this. I could add that custom Hash in the store, but that would be losing the cleaner and shorter code benefits that I'm trying to achieve with this feature request.

module Caching
  refine Hash do
    def cache(key)
      fetch(key) {self[key] = yield(key)}
    end
  end
end

using Caching

RequestStore.store.cache(:monitor_value_is_delayed?) do
  !MonitorValue.where('date >= ?', Time.now - 5.minutes).exists?
end

Updated by marcandre (Marc-Andre Lafortune) about 2 months ago

On Thu, Nov 26, 2020 at 9:05 PM nobu@ruby-lang.org wrote:

It is true ideally, but no one can guarantee the hash value never change.

Please, let's be serious. hash[obj] ||= something_that_changes_obj is nonsense code.

Updated by shyouhei (Shyouhei Urabe) about 2 months ago

+1 for the feature. I have had chances to write what is proposed here several times.

marcandre (Marc-Andre Lafortune) wrote in #note-12:

On Thu, Nov 26, 2020 at 9:05 PM nobu@ruby-lang.org wrote:

It is true ideally, but no one can guarantee the hash value never change.

Please, let's be serious. hash[obj] ||= something_that_changes_obj is nonsense code.

Yes I’m quite sure he is dead serious. There are idiots who write nonsense codes. As a language ruby has to be fool-proof.

Updated by marcandre (Marc-Andre Lafortune) about 1 month ago

I think this is a good feature. I like the fact that the this can not be implemented in pure Ruby as efficiently (i.e. with hash calculated once).

An alternative name is fetch_init, as it may gives a better idea that it is meant to initialize the value (once) for the key. That being said, fetch_set is a good name too.

Updated by Eregon (Benoit Daloze) about 1 month ago

shyouhei (Shyouhei Urabe) wrote in #note-13:

Yes I’m quite sure he is dead serious. There are idiots who write nonsense codes. As a language ruby has to be fool-proof.

I think such semantics (calling #hash only once) could be part of the method documentation.
OTOH calling #hash twice is probably not very expensive, and for some cases (e.g. Integer, String) can be optimized by a JIT or by memoizing the hash on the instance.

I don't like the name fetch_set much (does it fetch a Set?).
I think fetch_or_set or cache are better.

Updated by matz (Yukihiro Matsumoto) about 1 month ago

  • Status changed from Open to Feedback

Most of the case, hash[:key] ||= init works. The exception is that init value being false. But it should be rare.
So could you explain the concrete use-case for fetch_set()?

Besides that I am not fully satisfied with the name fetch_set().

Matz.

Updated by marcandre (Marc-Andre Lafortune) about 1 month ago

matz (Yukihiro Matsumoto) wrote in #note-16:

Most of the case, hash[:key] ||= init works. The exception is that init value being false. But it should be rare.

We could look into public codebases for better numbers; it is not the most common case, but it is definitely encountered it with some frequency (more with nil than false).

So could you explain the concrete use-case for fetch_set()?

One additional use-case would be a thread-safe equivalent to Ractor.current[:my_key] ||= init.

Besides that I am not fully satisfied with the name fetch_set().

Did you consider fetch_init?

Updated by Dan0042 (Daniel DeLorme) about 1 month ago

marcandre (Marc-Andre Lafortune) wrote in #note-17:

One additional use-case would be a thread-safe equivalent to Ractor.current[:my_key] ||= init.

+1

Updated by MaxLap (Maxime Lapointe) about 1 month ago

matz (Yukihiro Matsumoto) wrote in #note-16:

Most of the case, hash[:key] ||= init works. The exception is that init value being false. But it should be rare.

The problem is not only with false, but with nil too.

Anytime you want to cache something and nil (or false) is a possible value (often meaning the data is missing), this problem arises, as it means these cases are recalculated each time, to again indicate that the data is missing.

# If there is no config with that internal_id, each time you try to use the 
# cached version, which is meant to be fast, it would actually do a query to the database
def self.cached_by_internal_id(internal_id)
  my_cache[internal_id] ||= Config.where(internal_id: internal_id).first
end

It is very easy to forget that nil values means the cache is not going to act as a cache. It probably happens in many places, but is not critical, just makes things a little slower from redoing work many times.


As for the name, there are other options that were suggested:

  • fetch_init
  • cache

Updated by p8 (Petrik de Heus) about 1 month ago

Elixir has put_new and put_new_lazy:
https://hexdocs.pm/elixir/Map.html#put_new_lazy/3

Maybe store_if_new or store_new ?

cache.store_new(key) { calculation }

Updated by austin (Austin Ziegler) about 1 month ago

My emails appear not to be making it through the mailing list gateway into the tickets, so…

#note-15

I’m mostly doing Elixir these days, and one of the caching modules I use has as its main interface fetch_or_set, so I think that’s the best choice if this is added.

On the other hand, Nobu’s refinement implementation (#note-11) looks pretty good (and I say this not using refinements yet, even though they’ve been around for a while).

#note-20

put_new doesn’t work like this in Elixir, but it is similar in some ways (put only if the key doesn’t exist).

The correct Ruby implementation for this would be something like

hash.key?(key) || hash[key] = value

The advantage of a fetch_or_set method is that it can be implemented atomically for multithreading runtimes.

Updated by shyouhei (Shyouhei Urabe) about 1 month ago

MaxLap (Maxime Lapointe) wrote in #note-19:

matz (Yukihiro Matsumoto) wrote in #note-16:

Most of the case, hash[:key] ||= init works. The exception is that init value being false. But it should be rare.

The problem is not only with false, but with nil too.

Anytime you want to cache something and nil (or false) is a possible value (often meaning the data is missing), this problem arises,

Matz says it's rare.

I'm not sure if that is true, but so far nobody shows any evidence about its rarity.

Also available in: Atom PDF