Feature #18683
closedAllow to create hashes with a specific capacity.
Description
Various protocol parsers such as Redis RESP3
or msgpack
, have to create hashes, and they know the size in advance.
For efficiency, it would be preferable if they could directly allocate a Hash of the necessary size, so that large hashes wouldn't cause many re-alloccations.
Example of code that would benefit:
String
and Array
both already offer similar APIs:
String.new(capacity: XXX)
Array.new(XX) / rb_ary_new_capa(long)
However there's no such public API for Hashes, neither in Ruby land not in the C extension API.
Proposal¶
I think Hash.new
should accept a capacity:
named parameter:
hash = Hash.new(capacity: 1000)
Additionally I think the internal rb_hash_new_with_size
function should be exposed to C extensions as rb_hash_new_capa(long)
, for consistency with rb_ary_new_capa(long)
.
Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
This would break backwards compatibility:
Hash.new(capacity: 1000)[nil]
# => {:capacity=>1000}
Updated by byroot (Jean Boussier) over 2 years ago
That's a good point.
I suppose it could be worked around with either:
- A positional argument: `Hash.new(nil, capacity).
- An alternative constructor (e.g.
Hash.new_with_capcity
). - And instance setter
h = {}; h.capacity = 1024
.
Updated by shan (Shannon Skipper) over 2 years ago
Another option would be Hash.new { capacity }
like to_enum { size }
.
Updated by Eregon (Benoit Daloze) over 2 years ago
shan (Shannon Skipper) wrote in #note-3:
Another option would be
Hash.new { capacity }
liketo_enum { size }
.
That's conflicting with Hash.new { |h,k| h[k] = [] }
Updated by ko1 (Koichi Sasada) over 2 years ago
- +1 to export C-API with good name.
- idea: Make another API
Hash.create
with kwargs?
Updated by mame (Yusuke Endoh) over 2 years ago
I confirmed the proposed API actually brings performance improvements at least in a micro benchmark.
$ time ./miniruby -e '1000.times { h = {}; 100000.times {|x| h[x] = true } }'
real 0m8.403s
user 0m8.343s
sys 0m0.060s
$ time ./miniruby -e '1000.times { h = Hash.new_with_capacity(100000); 100000.times {|x| h[x] = true } }'
real 0m7.603s
user 0m7.533s
sys 0m0.070s
My preference of its API style is Hash.new(capacity: 100000)
. Can we first deprecate any keyword arguments for Hash.new and then introduce the capacity keyword?
diff --git a/hash.c b/hash.c
index da85fd35c6..0d0faf6ecc 100644
--- a/hash.c
+++ b/hash.c
@@ -1559,10 +1559,10 @@ copy_compare_by_id(VALUE hash, VALUE basis)
return hash;
}
-MJIT_FUNC_EXPORTED VALUE
-rb_hash_new_with_size(st_index_t size)
+static VALUE
+hash_alloc_with_size(VALUE klass, st_index_t size)
{
- VALUE ret = rb_hash_new();
+ VALUE ret = hash_alloc(klass);
if (size == 0) {
/* do nothing */
}
@@ -1575,6 +1575,12 @@ rb_hash_new_with_size(st_index_t size)
return ret;
}
+MJIT_FUNC_EXPORTED VALUE
+rb_hash_new_with_size(st_index_t size)
+{
+ return hash_alloc_with_size(rb_cHash, size);
+}
+
static VALUE
hash_copy(VALUE ret, VALUE hash)
{
@@ -1904,6 +1910,15 @@ rb_hash_s_create(int argc, VALUE *argv, VALUE klass)
return hash;
}
+static VALUE
+rb_hash_s_new_with_capa(VALUE klass, VALUE size)
+{
+ VALUE hash;
+ hash = hash_alloc_with_size(klass, NUM2LONG(size));
+ hash_verify(hash);
+ return hash;
+}
+
MJIT_FUNC_EXPORTED VALUE
rb_to_hash_type(VALUE hash)
{
@@ -7155,6 +7170,7 @@ Init_Hash(void)
rb_define_alloc_func(rb_cHash, empty_hash_alloc);
rb_define_singleton_method(rb_cHash, "[]", rb_hash_s_create, -1);
rb_define_singleton_method(rb_cHash, "try_convert", rb_hash_s_try_convert, 1);
+ rb_define_singleton_method(rb_cHash, "new_with_capacity", rb_hash_s_new_with_capa, 1);
rb_define_method(rb_cHash, "initialize", rb_hash_initialize, -1);
rb_define_method(rb_cHash, "initialize_copy", rb_hash_replace, 1);
rb_define_method(rb_cHash, "rehash", rb_hash_rehash, 0);
Updated by knu (Akinori MUSHA) over 2 years ago
Maybe you could add it as the second argument, making the default argument mandatory when specifying a capacity: Hash.new(nil, capa)
Edit: I missed byroot's comment... 🙁
Updated by ko1 (Koichi Sasada) over 2 years ago
in dev-meeting,
- exposing C-API is okay (
rb_hash_new_capa()
)`. - Matz:
Hash.new(capcacity:)
is not acceptable to keep compatibility. - Matz:
Hash.new(ifnone, capa = 0)
is not acceptable. - akr: How about to use
Array#to_h
by usingrb_hash_bulk_insert()
and it can improve performance.
memo:
rb_hash_bulk_insert()
prepares internal array of table and rehash once.
Now Array#to_h
doesn't use rb_hash_bulk_insert()
, but if we use it, it can be faster than naïve insertion loop.
Updated by byroot (Jean Boussier) over 2 years ago
The problem with Array#to_h
, is that it require an array of pairs [[k, v], [k, v]]
. So what you'll get by right sizing, you'll likely loose via all these extra array allocations.
Hash[*[1, 2, 3, 4]]
works but is obviously even slower.
But yes, I wouldn't mind an efficient way to create a Hash out of a flat array.
Updated by Eregon (Benoit Daloze) over 2 years ago
I also thought about Array#to_h
, but in cases such as deserialization like above there is no Array and the extra allocations/operations would be non-trivial costs.
I think a separate method like Hash.create
would be a good way to expose it to Ruby (I don't like setter API, can easily be used wrong).
Updated by Dan0042 (Daniel DeLorme) over 2 years ago
mame (Yusuke Endoh) wrote in #note-6:
My preference of its API style is
Hash.new(capacity: 100000)
. Can we first deprecate any keyword arguments for Hash.new and then introduce the capacity keyword?
I approve of this. What was the point of "real keyword arguments" if we can't even start adding keywords to existing methods? It's possible to minimize the incompatibility to the point where it's almost nonexistent:
def Hash.new(*args, **kw)
if kw.size == 1 and kw[:capacity].is_a?(Integer)
capacity = kw[:capacity]
elsif kw.any?
#maybe warn("using keywords as positional hash") if $VERBOSE
args << kw
end
#allocate...
end
Also, in a search of 1018 gems I only found a single instance of \bHash\.new\((\w+:|:\w+ ?=>)
cucumber-7.1.0/lib/cucumber/multiline_argument/data_table.rb
78: NULL_CONVERSIONS = Hash.new(strict: false, proc: ->(cell_value) { cell_value }).freeze
Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
Dan0042 (Daniel DeLorme) wrote in #note-11:
What was the point of "real keyword arguments" if we can't even start adding keywords to existing methods?
The tradeoff made in Ruby 3 was to be more backwards compatible. Otherwise, we would have had to break all usage of method(key: value)
where method
accepted an optional positional argument. There are a lot of these methods, especially for code written before Ruby 2 or that did not use keyword arguments due to all of the problems with keyword arguments before Ruby 3.
Going forward, if you want to be able to safely add keyword arguments later, you can use **nil
in the method definition to indicate that the method does not currently accept any keywords. This will prohibit keyword to positional argument conversion.
Updated by byroot (Jean Boussier) over 2 years ago
I've prepared a patch for the C-API since it's already accepted: https://github.com/ruby/ruby/pull/5835
I've also looked at Array#to_h
implementation, but I'm not sure to understand what the bulk_insert
suggestion is about.
As for Hash.create
I wouldn't mind it, if there's consensus for it I can look at implementing it.
Updated by hsbt (Hiroshi SHIBATA) about 2 years ago
- Status changed from Open to Closed
https://github.com/ruby/ruby/pull/5835 has been merged. I'll close this.
@byroot (Jean Boussier) If you have any concern or proposal, please re-open or file another issue. Thanks.
Updated by byroot (Jean Boussier) about 2 years ago
Yeah the ruby side of the proposal wasn't merged, but I'll open an issue specifically for it.
Updated by byroot (Jean Boussier) about 2 years ago
- Related to Feature #19236: Allow to create hashes with a specific capacity from Ruby added