Project

General

Profile

Bug #16121

Stop making a redundant hash copy in Hash#dup

Added by dylants (Dylan Thacker-Smith) 3 months ago. Updated 24 days ago.

Status:
Closed
Priority:
Normal
Target version:
-
ruby -v:
ruby 2.7.0dev (2019-08-23T16:41:09Z master b38ab0a3a9) [x86_64-darwin18]
[ruby-core:94507]

Description

Problem

I noticed while profiling object allocations that Hash#dup was allocating 2 objects instead of only 1 as expected. I looked for alternatives for comparison and found that Hash[hash] created a copy with only a single object allocation and seemed to be more than twice as fast. Reading the source code revealed the difference was that Hash#dup creates a copy of the Hash, then rehashes the copy. However, rehashing is done by making a copy of the hash, so the first copy before rehashing was unnecessary.

Solution

I changed the code to just use rehashing to make the copy of the hash to improve performance while also preserving the existing behaviour.

Benchmark

require 'benchmark'

N = 100000

def report(x, name)
  x.report(name) do
    N.times do
      yield
    end
  end
end

hashes = {
  small_hash: { a: 1 },
  larger_hash: 20.times.map { |i| [('a'.ord + i).chr.to_sym, i] }.to_h
}

Benchmark.bmbm do |x|
  hashes.each do |name, hash|
    report(x, "#{name}.dup") do
      hash.dup
    end
  end
end

results on master

                      user     system      total        real
small_hash.dup    0.401350   0.001638   0.402988 (  0.404608)
larger_hash.dup   7.218548   0.433616   7.652164 (  7.695990)

results with the attached patch

                      user     system      total        real
small_hash.dup    0.336733   0.002425   0.339158 (  0.341760)
larger_hash.dup   6.617343   0.398407   7.015750 (  7.070282)

Files

0001-Remove-redundant-Check_Type-after-to_hash.diff.txt (624 Bytes) 0001-Remove-redundant-Check_Type-after-to_hash.diff.txt [PATCH 1/4] Remove redundant Check_Type after to_hash dylants (Dylan Thacker-Smith), 08/23/2019 07:55 PM
0002-Fix-freeing-and-clearing-destination-hash-in-Hash.diff.txt (1.57 KB) 0002-Fix-freeing-and-clearing-destination-hash-in-Hash.diff.txt [PATCH 2/4] Fix freeing and clearing destination hash in Hash#initialize_copy dylants (Dylan Thacker-Smith), 08/23/2019 07:55 PM
0003-Remove-dead-code-paths-in-rb_hash_initialize_copy.diff.txt (1.12 KB) 0003-Remove-dead-code-paths-in-rb_hash_initialize_copy.diff.txt [PATCH 3/4] Remove dead code paths in rb_hash_initialize_copy dylants (Dylan Thacker-Smith), 08/23/2019 07:55 PM
0004-Stop-making-a-redundant-hash-copy-in-Hash-dup.diff.txt (1.35 KB) 0004-Stop-making-a-redundant-hash-copy-in-Hash-dup.diff.txt [PATCH 4/4] Stop making a redundant hash copy in Hash#dup dylants (Dylan Thacker-Smith), 08/23/2019 07:55 PM

Associated revisions

Revision b9702590
Added by dylants (Dylan Thacker-Smith) 24 days ago

Stop making a redundant hash copy in Hash#dup (#2489)

  • Stop making a redundant hash copy in Hash#dup

It was making a copy of the hash without rehashing, then created an
extra copy of the hash to do the rehashing. Since rehashing creates
a new copy already, this change just uses that rehashing to make
the copy.

[Bug #16121]

  • Remove redundant Check_Type after to_hash

  • Fix freeing and clearing destination hash in Hash#initialize_copy

The code was assuming the state of the destination hash based on the
source hash for clearing any existing table on it. If these don't match,
then that can cause the old table to be leaked. This can be seen by
compiling hash.c with #define HASH_DEBUG 1 and running the following
script, which will crash from a debug assertion.

h = 9.times.map { |i| [i, i] }.to_h
h.send(:initialize_copy, {})
  • Remove dead code paths in rb_hash_initialize_copy

Given that RHASH_ST_TABLE_P(h) is defined as (!RHASH_AR_TABLE_P(h))
it shouldn't be possible for a hash to be neither of these, so there
is no need for the removed else if blocks.

  • Share implementation between Hash#replace and Hash#initialize_copy

This also fixes key rehashing for small hashes backed by an array
table for Hash#replace. This used to be done consistently in ruby
2.5.x, but stopped being done for small arrays in ruby 2.6.x.

This also bring optimization improvements that were done for
Hash#initialize_copy to Hash#replace.

  • Add the Hash#dup benchmark

Revision 5d63a9da
Added by nobu (Nobuyoshi Nakada) 24 days ago

[Bug #16121] adjusted indent [ci skip]

History

Updated by dylants (Dylan Thacker-Smith) 3 months ago

I split up the patch into the 4 attached patches, because I realized there were a few changes that could be emphasized independently, each with their own description of the change. Hopefully that will make the optimization itself in the last patch easier to review.

Technically that makes this a bug fix, since "Patch 2/4" does fix a memory leak in the following contrived code

h = 9.times.map { |i| [i, i] }.to_h
h.send(:initialize_copy, {})

where the st_table for the hash h won't get freed with st_free_table because it doesn't try to call that on the if (RHASH_AR_TABLE_P(hash2)) { conditional block in rb_hash_initialize_copy. Mostly I didn't want to ignore it and turn into undefined behaviour from the optimization.

#2

Updated by dylants (Dylan Thacker-Smith) 3 months ago

  • File deleted (0001-Stop-making-a-redundant-hash-copy-in-Hash-dup.diff.txt)

Updated by ko1 (Koichi Sasada) about 2 months ago

  • Assignee set to ko1 (Koichi Sasada)

I'll review it.

Updated by ko1 (Koichi Sasada) about 2 months ago

Thank you. Completely fine.
Do you want to merge via github PR or attached patch by my commit?

Updated by ko1 (Koichi Sasada) about 2 months ago

I found that

rb_define_method(rb_cArray, "initialize_copy", rb_ary_replace, 1);

Array#initialize_copy == Array#replace. Can we use same technique?

Updated by dylants (Dylan Thacker-Smith) about 2 months ago

ko1 (Koichi Sasada) wrote:

I found that

rb_define_method(rb_cArray, "initialize_copy", rb_ary_replace, 1);

Array#initialize_copy == Array#replace. Can we use same technique?

I tried doing that and it caused test failures because Hash#replace stop rehashing keys in ruby 2.6. I've opened a bug for that: https://bugs.ruby-lang.org/issues/16187

I've opened https://github.com/ruby/ruby/pull/2489 which replaces the implementation of both of these methods with a single optimized implementation that always rehashes.

I benchmarked it with the following benchmark/hash_dup.yml file

prelude: |
  small_hash = { a: 1 }
  larger_hash = 20.times.map { |i| [('a'.ord + i).chr.to_sym, i] }.to_h

benchmark:
  dup_small: small_hash.dup
  dup_larger: larger_hash.dup
  replace_small: '{}.replace(small_hash)'
  replace_larger: '{}.replace(larger_hash)'
loop_count: 100000

and here are the results against the latest ruby master commit

Comparison:
                        dup_small
          built-ruby:   4105090.4 i/s
        compare-ruby:   3089471.1 i/s - 1.33x  slower

                       dup_larger
          built-ruby:    682873.5 i/s
        compare-ruby:    518623.8 i/s - 1.32x  slower

                    replace_small
        compare-ruby:  10697475.5 i/s
          built-ruby:   9399379.9 i/s - 1.14x  slower

                   replace_larger
          built-ruby:    807767.5 i/s
        compare-ruby:    358383.1 i/s - 2.25x  slower

The replace_small case is slower due to rehashing, but is still much faster than on ruby version 2.5.6 that did rehash keys of small hashes:

                    replace_small
          built-ruby:   8847991.6 i/s
        compare-ruby:   2178174.7 i/s - 4.06x  slower

Updated by ko1 (Koichi Sasada) 24 days ago

Thank you, I merged it!

#8

Updated by dylants (Dylan Thacker-Smith) 24 days ago

  • Status changed from Open to Closed

Applied in changeset git|b9702590445dfea62d271d0a5c942b7adfaaacdd.


Stop making a redundant hash copy in Hash#dup (#2489)

  • Stop making a redundant hash copy in Hash#dup

It was making a copy of the hash without rehashing, then created an
extra copy of the hash to do the rehashing. Since rehashing creates
a new copy already, this change just uses that rehashing to make
the copy.

[Bug #16121]

  • Remove redundant Check_Type after to_hash

  • Fix freeing and clearing destination hash in Hash#initialize_copy

The code was assuming the state of the destination hash based on the
source hash for clearing any existing table on it. If these don't match,
then that can cause the old table to be leaked. This can be seen by
compiling hash.c with #define HASH_DEBUG 1 and running the following
script, which will crash from a debug assertion.

h = 9.times.map { |i| [i, i] }.to_h
h.send(:initialize_copy, {})
  • Remove dead code paths in rb_hash_initialize_copy

Given that RHASH_ST_TABLE_P(h) is defined as (!RHASH_AR_TABLE_P(h))
it shouldn't be possible for a hash to be neither of these, so there
is no need for the removed else if blocks.

  • Share implementation between Hash#replace and Hash#initialize_copy

This also fixes key rehashing for small hashes backed by an array
table for Hash#replace. This used to be done consistently in ruby
2.5.x, but stopped being done for small arrays in ruby 2.6.x.

This also bring optimization improvements that were done for
Hash#initialize_copy to Hash#replace.

  • Add the Hash#dup benchmark

Also available in: Atom PDF