Project

General

Profile

Actions

Feature #18683

open

Allow to create hashes with a specific capacity.

Added by byroot (Jean Boussier) 3 months ago. Updated 2 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:108185]

Description

Various protocol parsers such as Redis RESP3 or msgpack, have to create hashes, and they know the size in advance.
For efficiency, it would be preferable if they could directly allocate a Hash of the necessary size, so that large hashes wouldn't cause many re-alloccations.

Example of code that would benefit:

String and Array both already offer similar APIs:

String.new(capacity: XXX)
Array.new(XX) / rb_ary_new_capa(long)

However there's no such public API for Hashes, neither in Ruby land not in the C extension API.

Proposal

I think Hash.new should accept a capacity: named parameter:

hash = Hash.new(capacity: 1000)

Additionally I think the internal rb_hash_new_with_size function should be exposed to C extensions as rb_hash_new_capa(long), for consistency with rb_ary_new_capa(long).

Updated by jeremyevans0 (Jeremy Evans) 3 months ago

This would break backwards compatibility:

Hash.new(capacity: 1000)[nil]
# => {:capacity=>1000}

Updated by byroot (Jean Boussier) 3 months ago

That's a good point.

I suppose it could be worked around with either:

  • A positional argument: `Hash.new(nil, capacity).
  • An alternative constructor (e.g. Hash.new_with_capcity).
  • And instance setter h = {}; h.capacity = 1024.

Updated by shan (Shannon Skipper) 2 months ago

Another option would be Hash.new { capacity } like to_enum { size }.

Updated by Eregon (Benoit Daloze) 2 months ago

shan (Shannon Skipper) wrote in #note-3:

Another option would be Hash.new { capacity } like to_enum { size }.

That's conflicting with Hash.new { |h,k| h[k] = [] }

Updated by ko1 (Koichi Sasada) 2 months ago

  • +1 to export C-API with good name.
  • idea: Make another API Hash.create with kwargs?

Updated by mame (Yusuke Endoh) 2 months ago

I confirmed the proposed API actually brings performance improvements at least in a micro benchmark.

$ time ./miniruby -e '1000.times { h = {}; 100000.times {|x| h[x] = true } }'

real    0m8.403s
user    0m8.343s
sys     0m0.060s

$ time ./miniruby -e '1000.times { h = Hash.new_with_capacity(100000); 100000.times {|x| h[x] = true } }'

real    0m7.603s
user    0m7.533s
sys     0m0.070s

My preference of its API style is Hash.new(capacity: 100000). Can we first deprecate any keyword arguments for Hash.new and then introduce the capacity keyword?

diff --git a/hash.c b/hash.c
index da85fd35c6..0d0faf6ecc 100644
--- a/hash.c
+++ b/hash.c
@@ -1559,10 +1559,10 @@ copy_compare_by_id(VALUE hash, VALUE basis)
     return hash;
 }

-MJIT_FUNC_EXPORTED VALUE
-rb_hash_new_with_size(st_index_t size)
+static VALUE
+hash_alloc_with_size(VALUE klass, st_index_t size)
 {
-    VALUE ret = rb_hash_new();
+    VALUE ret = hash_alloc(klass);
     if (size == 0) {
         /* do nothing */
     }
@@ -1575,6 +1575,12 @@ rb_hash_new_with_size(st_index_t size)
     return ret;
 }

+MJIT_FUNC_EXPORTED VALUE
+rb_hash_new_with_size(st_index_t size)
+{
+    return hash_alloc_with_size(rb_cHash, size);
+}
+
 static VALUE
 hash_copy(VALUE ret, VALUE hash)
 {
@@ -1904,6 +1910,15 @@ rb_hash_s_create(int argc, VALUE *argv, VALUE klass)
     return hash;
 }

+static VALUE
+rb_hash_s_new_with_capa(VALUE klass, VALUE size)
+{
+    VALUE hash;
+    hash = hash_alloc_with_size(klass, NUM2LONG(size));
+    hash_verify(hash);
+    return hash;
+}
+
 MJIT_FUNC_EXPORTED VALUE
 rb_to_hash_type(VALUE hash)
 {
@@ -7155,6 +7170,7 @@ Init_Hash(void)
     rb_define_alloc_func(rb_cHash, empty_hash_alloc);
     rb_define_singleton_method(rb_cHash, "[]", rb_hash_s_create, -1);
     rb_define_singleton_method(rb_cHash, "try_convert", rb_hash_s_try_convert, 1);
+    rb_define_singleton_method(rb_cHash, "new_with_capacity", rb_hash_s_new_with_capa, 1);
     rb_define_method(rb_cHash, "initialize", rb_hash_initialize, -1);
     rb_define_method(rb_cHash, "initialize_copy", rb_hash_replace, 1);
     rb_define_method(rb_cHash, "rehash", rb_hash_rehash, 0);

Updated by knu (Akinori MUSHA) 2 months ago

Maybe you could add it as the second argument, making the default argument mandatory when specifying a capacity: Hash.new(nil, capa)

Edit: I missed byroot's comment... 🙁

Updated by ko1 (Koichi Sasada) 2 months ago

in dev-meeting,

  • exposing C-API is okay (rb_hash_new_capa())`.
  • Matz: Hash.new(capcacity:) is not acceptable to keep compatibility.
  • Matz: Hash.new(ifnone, capa = 0) is not acceptable.
  • akr: How about to use Array#to_h by using rb_hash_bulk_insert() and it can improve performance.

memo:

rb_hash_bulk_insert() prepares internal array of table and rehash once.
Now Array#to_h doesn't use rb_hash_bulk_insert(), but if we use it, it can be faster than naïve insertion loop.

Updated by byroot (Jean Boussier) 2 months ago

The problem with Array#to_h, is that it require an array of pairs [[k, v], [k, v]]. So what you'll get by right sizing, you'll likely loose via all these extra array allocations.

Hash[*[1, 2, 3, 4]] works but is obviously even slower.

But yes, I wouldn't mind an efficient way to create a Hash out of a flat array.

Updated by Eregon (Benoit Daloze) 2 months ago

I also thought about Array#to_h, but in cases such as deserialization like above there is no Array and the extra allocations/operations would be non-trivial costs.
I think a separate method like Hash.create would be a good way to expose it to Ruby (I don't like setter API, can easily be used wrong).

Updated by Dan0042 (Daniel DeLorme) 2 months ago

mame (Yusuke Endoh) wrote in #note-6:

My preference of its API style is Hash.new(capacity: 100000). Can we first deprecate any keyword arguments for Hash.new and then introduce the capacity keyword?

I approve of this. What was the point of "real keyword arguments" if we can't even start adding keywords to existing methods? It's possible to minimize the incompatibility to the point where it's almost nonexistent:

def Hash.new(*args, **kw)
  if kw.size == 1 and kw[:capacity].is_a?(Integer)
    capacity = kw[:capacity]
  elsif kw.any?
    #maybe warn("using keywords as positional hash") if $VERBOSE
    args << kw
  end
  #allocate...
end

Also, in a search of 1018 gems I only found a single instance of \bHash\.new\((\w+:|:\w+ ?=>)

cucumber-7.1.0/lib/cucumber/multiline_argument/data_table.rb
78:      NULL_CONVERSIONS = Hash.new(strict: false, proc: ->(cell_value) { cell_value }).freeze

Updated by jeremyevans0 (Jeremy Evans) 2 months ago

Dan0042 (Daniel DeLorme) wrote in #note-11:

What was the point of "real keyword arguments" if we can't even start adding keywords to existing methods?

The tradeoff made in Ruby 3 was to be more backwards compatible. Otherwise, we would have had to break all usage of method(key: value) where method accepted an optional positional argument. There are a lot of these methods, especially for code written before Ruby 2 or that did not use keyword arguments due to all of the problems with keyword arguments before Ruby 3.

Going forward, if you want to be able to safely add keyword arguments later, you can use **nil in the method definition to indicate that the method does not currently accept any keywords. This will prohibit keyword to positional argument conversion.

Updated by byroot (Jean Boussier) 2 months ago

I've prepared a patch for the C-API since it's already accepted: https://github.com/ruby/ruby/pull/5835

I've also looked at Array#to_h implementation, but I'm not sure to understand what the bulk_insert suggestion is about.

As for Hash.create I wouldn't mind it, if there's consensus for it I can look at implementing it.

Actions

Also available in: Atom PDF