Project

General

Profile

Actions

Feature #9425

closed

[PATCH] st: use power-of-two sizes to avoid slow modulo ops

Added by normalperson (Eric Wong) almost 11 years ago. Updated almost 11 years ago.

Status:
Closed
Assignee:
-
Target version:
[ruby-core:59836]

Description

Prime number-sized hash tables are only needed to compensate for bad
hash functions. Ruby has good hash functions nowadays, so reduce our
code size with power-of-two-sized hash tables which allows us to avoid
the slow modulo operation. I expected numhash performance to be worse,
but it seems most of those hashes are either too-small-to-matter or
well-distributed anyways. If we find problems with some existing
numhashes we should start using a proper hash function (Murmur via
st_hash_uint) on those.

This consistently speeds up the bm_hash_flatten and bm_vm2_bighash.

target 0: trunk (ruby 2.2.0dev (2014-01-17 trunk 44631) [x86_64-linux]) at "/home/ew/rrrr/i/bin/ruby --disable=gems"
target 1: st-noprime (ruby 2.2.0dev (2014-01-17 trunk 44631) [x86_64-linux]) at "/home/ew/ruby/i/bin/ruby --disable=gems"

Benchmarks on a Xeon E3-1230 v3 CPU:

minimum results in each 10 measurements.
Execution time (sec)
name	trunk	st-noprime
hash_flatten	0.500	0.345
hash_keys	0.191	0.192
hash_shift	0.019	0.018
hash_values	0.201	0.200
loop_whileloop2	0.090	0.090
vm2_bighash*	4.457	3.578

Speedup ratio: compare with the result of `trunk' (greater is better)
name	st-noprime
hash_flatten	1.451
hash_keys	0.998
hash_shift	1.046
hash_values	1.003
loop_whileloop2	1.000
vm2_bighash*	1.246

Somewhat less impressive on an AMD FX 8320:

minimum results in each 10 measurements.
Execution time (sec)
name    trunk   st-noprime
hash_flatten    0.633   0.596
hash_keys       0.236   0.232
hash_shift      0.031   0.032
hash_values     0.234   0.238
loop_whileloop2 0.135   0.135
vm2_bighash*    8.198   6.982

Speedup ratio: compare with the result of `trunk' (greater is better)
name    st-noprime
hash_flatten    1.063
hash_keys       1.020
hash_shift      0.976
hash_values     0.982
loop_whileloop2 1.000
vm2_bighash*    1.174

The following changes since commit 2c3522c3e403dfdadaaf6095564bde364cc4bddf:

test_thread.rb: stop at once (2014-01-16 06:34:47 +0000)

are available in the git repository at:

git://80x24.org/ruby.git st-noprime

for you to fetch changes up to ed4f4103f4f407ed99dd6cd25b6c35d3aa9f3479:

st: use power-of-two sizes to avoid slow modulo ops (2014-01-18 04:05:21 +0000)


Eric Wong (1):
st: use power-of-two sizes to avoid slow modulo ops

st.c | 100 +++++++++++++++++--------------------------------------------------
1 file changed, 25 insertions(+), 75 deletions(-)


Files

Updated by normalperson (Eric Wong) almost 11 years ago

Updated patch and pull due to r44634 (hash_pos)
http://bogomips.org/ruby.git/patch?id=7c7e41496d6b28a4b

The following changes since commit 5ecbe189af77c309845d662f26c0b2797bde2915:

socket/option.c: helper functions (2014-01-19 14:56:05 +0000)

are available in the git repository at:

git://80x24.org/ruby.git st-noprime-r44658

for you to fetch changes up to 7c7e41496d6b28a4b1a516c0df9754c03447a95d:

st: use power-of-two sizes to avoid slow modulo ops (2014-01-19 23:45:52 +0000)


Eric Wong (1):
st: use power-of-two sizes to avoid slow modulo ops

st.c | 61 +++++--------------------------------------------------------
1 file changed, 5 insertions(+), 56 deletions(-)

Updated by shyouhei (Shyouhei Urabe) almost 11 years ago

Impressive.

Plus I think you have more rooms for optimizations by sticking to
power-of-two sized bins. When you rehash a hash, right now all
elements are cleaned up from the bin, resize it, then insert them
again one-by-one. If number of bins are always 2^n, rehashing is
to resize a 2^n array to 2^(n+1). This case the elements stored in
new_bins[2^n+k] can only come from new_bins[k]. This fact does not
change the algorithmic complexity, but can reduce insertions.

https://github.com/shyouhei/ruby/commit/f7ca891

Updated by normalperson (Eric Wong) almost 11 years ago

wrote:

Plus I think you have more rooms for optimizations by sticking to
power-of-two sized bins. When you rehash a hash, right now all
elements are cleaned up from the bin, resize it, then insert them
again one-by-one. If number of bins are always 2^n, rehashing is
to resize a 2^n array to 2^(n+1). This case the elements stored in
new_bins[2^n+k] can only come from new_bins[k]. This fact does not
change the algorithmic complexity, but can reduce insertions.

https://github.com/shyouhei/ruby/commit/f7ca891

Thanks! However, I wasn't able to show a difference with
"make benchmark"[1]. Were you?

Perhaps rehash is not called often enough, and I think a branch
inside the loop is difficult for the CPU to optimize. I think
the current dumb loop is very good for CPU pipelining and prefetch.

[1] I have applied my patches for improved benchmark consistency:
https://bugs.ruby-lang.org/issues/5985#change-44442
https://bugs.ruby-lang.org/issues/9430

Updated by shyouhei (Shyouhei Urabe) almost 11 years ago

On 01/21/2014 11:38 AM, Eric Wong wrote:

https://github.com/shyouhei/ruby/commit/f7ca891

Thanks! However, I wasn't able to show a difference with
"make benchmark"[1]. Were you?

Perhaps rehash is not called often enough, and I think a branch
inside the loop is difficult for the CPU to optimize. I think
the current dumb loop is very good for CPU pipelining and prefetch.

No, sorry I see no evident speedup. When I wrote the patch I thought the
function was used for Hash#rehash, but it turned out Hash#rehash uses
something different (don't know why). The optimization is valid I
believe but in fact used very rarely.

Updated by normalperson (Eric Wong) almost 11 years ago

Urabe Shyouhei wrote:

No, sorry I see no evident speedup. When I wrote the patch I thought the
function was used for Hash#rehash, but it turned out Hash#rehash uses
something different (don't know why). The optimization is valid I
believe but in fact used very rarely.

Alright. My understanding is branch mispredict costs are higher than
the memory stores which would be avoided. The expensive part is loading
memory on cache miss, and that is not avoided.

We'll probably need to poke around with perf or similar tools to
analyze/confirm this.

Updated by normalperson (Eric Wong) almost 11 years ago

Notes to self (or anybody else with free time right now):
test and verify compare_by_identity performance

I'm comfortable that ID, string and most objects will hash well with
power-of-two; but compare_by_identity, weakmap, vm->living_threads may
hash less well without a prime modulo (or maybe they hash badly
regardless of modulo!)

I'm also interested in using a doubly-linked list (ccan/list[1])
for vm->living_threads (and possibly other places).

[1] BSDL version of the Linux kernel list.h at http://ccodearchive.net/

Updated by normalperson (Eric Wong) almost 11 years ago

wrote:

test and verify compare_by_identity performance

I'm comfortable that ID, string and most objects will hash well with
power-of-two; but compare_by_identity, weakmap, vm->living_threads may
hash less well without a prime modulo (or maybe they hash badly
regardless of modulo!)

OK, I was right about compare_by_identity being worse with power-of-two,
but I fixed it by tweaking numhash:

http://bogomips.org/ruby.git/patch?id=1579e9d0d82789

I was wrong about IDs hashing well before, they hash OK now :)

results: http://80x24.org/bmlog-20140303-034047.26775.txt
the hash parts:
hash_aref_miss 1.048
hash_aref_str 1.162
hash_aref_sym 1.000
hash_flatten 1.092
hash_ident_num 1.007
hash_ident_obj 1.098
hash_ident_str 1.106
hash_ident_sym 1.018
hash_keys 1.000
hash_shift 1.011
hash_values 1.011
vm2_bighash* 1.183

These numbers are from my weaker AMD FX-8320 which gave me worse numbers
than my Haswell machine in my original test. I'll try to test on my
Haswell machine soon (network outage there :<).

I'd like to commit the following three patches soon:
http://bogomips.org/ruby.git/patch?id=a3fde671ffeec8 new hash benchmarks
http://bogomips.org/ruby.git/patch?id=8f155afef61342 original patch
http://bogomips.org/ruby.git/patch?id=1579e9d0d82789 numhash tweak

Updated by normalperson (Eric Wong) almost 11 years ago

wrote:

 hash_aref_sym	1.000

Lack of improvement here was disappointing since symbol keys are common,
and this showed a regression on my x86 (32-bit) VMs.

I tweaked rb_any_hash to be symbol-aware:

http://bogomips.org/ruby.git/patch?id=497ed6355

12-30% improvement on this test from trunk depending on CPU so far \o/
(Phenom II X4 got 30%, newer/faster x86-64 CPUs show less speedup).

I'm comfortable with improvements of this series on x86 VMs running on
x86-64 (and of course native x86-64).

Can anybody with real 32-bit hardware verify this series? Not sure I
can trust VM results; my remaining x86 hardware is on its last legs
and showing occasional HW errors.

git://80x24.org/ruby.git st-noprime-v4

   st: use power-of-two sizes to avoid slow modulo ops
   add hash benchmarks
   st.c: tweak numhash to match common Ruby use cases
   hash.c: improve symbol hash distribution

Updated by funny_falcon (Yura Sokolov) almost 11 years ago

Can you test this

  • if (SYMBOL_P(a)) a = (a >> RUBY_SPECIAL_SHIFT) ^ (a >>
    (RUBY_SPECIAL_SHIFT + 3)); /* 3=ID_SCOPE_SHIFT */

2014-03-18 13:02 GMT+04:00 Eric Wong :

wrote:

  hash_aref_sym   1.000

Lack of improvement here was disappointing since symbol keys are common,
and this showed a regression on my x86 (32-bit) VMs.

I tweaked rb_any_hash to be symbol-aware:

    http://bogomips.org/ruby.git/patch?id=497ed6355

12-30% improvement on this test from trunk depending on CPU so far \o/
(Phenom II X4 got 30%, newer/faster x86-64 CPUs show less speedup).

I'm comfortable with improvements of this series on x86 VMs running on
x86-64 (and of course native x86-64).

Can anybody with real 32-bit hardware verify this series? Not sure I
can trust VM results; my remaining x86 hardware is on its last legs
and showing occasional HW errors.

git://80x24.org/ruby.git st-noprime-v4

  st: use power-of-two sizes to avoid slow modulo ops
  add hash benchmarks
  st.c: tweak numhash to match common Ruby use cases
  hash.c: improve symbol hash distribution

Updated by normalperson (Eric Wong) almost 11 years ago

wrote:

Can you test this

  • if (SYMBOL_P(a)) a = (a >> RUBY_SPECIAL_SHIFT) ^ (a >>
    (RUBY_SPECIAL_SHIFT + 3)); /* 3=ID_SCOPE_SHIFT */

Seems to hurt performance on x86 compared to my original. I don't think
ID scope bits are useful for this case. From what I've seen, Ruby users
tend to limit symbol names to \w when using the Hash class and keywords.

Actions #11

Updated by Anonymous almost 11 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

Applied in changeset r45384.


st.c: use power-of-two sizes to avoid slow modulo ops

  • st.c (hash_pos): use bitwise AND to avoid slow modulo op
    (new_size): power-of-two sizes for hash_pos change
    (st_numhash): adjust for common keys due to lack of prime modulo
    [Feature #9425]
  • hash.c (rb_any_hash): right shift for symbols
  • benchmark/bm_hash_aref_miss.rb: added to show improvement
  • benchmark/bm_hash_aref_sym_long.rb: ditto
  • benchmark/bm_hash_aref_str.rb: ditto
  • benchmark/bm_hash_aref_sym.rb: ditto
  • benchmark/bm_hash_ident_num.rb: added to prevent regression
  • benchmark/bm_hash_ident_obj.rb: ditto
  • benchmark/bm_hash_ident_str.rb: ditto
  • benchmark/bm_hash_ident_sym.rb: ditto

Updated by normalperson (Eric Wong) almost 11 years ago

wrote:

OK, I was right about compare_by_identity being worse with power-of-two,
but I fixed it by tweaking numhash:

http://bogomips.org/ruby.git/patch?id=1579e9d0d82789

I tweaked that further, adding +3 instead of +1 to RUBY_SPECIAL_SHIFT
in r45384. Also updated NEWS in case some extensions need tweaking for
performance.

Haswell Xeon E3-1230 v3 numbers:

hash_aref_miss 1.166
hash_aref_str 1.167
hash_aref_sym 1.224
hash_aref_sym_long 1.270
hash_flatten 1.656
hash_ident_num 1.142
hash_ident_obj 1.193
hash_ident_str 1.194
hash_ident_sym 1.171
hash_keys 1.002
hash_shift 1.122
hash_values 1.006
loop_whileloop2 1.001
vm2_bighash* 1.233

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0