https://bugs.ruby-lang.org/https://bugs.ruby-lang.org/favicon.ico?17113305112014-01-20T04:02:53ZRuby Issue Tracking SystemRuby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=444412014-01-20T04:02:53Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p>Updated patch and pull due to r44634 (hash_pos)<br>
<a href="http://bogomips.org/ruby.git/patch?id=7c7e41496d6b28a4b" class="external">http://bogomips.org/ruby.git/patch?id=7c7e41496d6b28a4b</a></p>
<p>The following changes since commit 5ecbe189af77c309845d662f26c0b2797bde2915:</p>
<p>socket/option.c: helper functions (2014-01-19 14:56:05 +0000)</p>
<p>are available in the git repository at:</p>
<p>git://80x24.org/ruby.git st-noprime-r44658</p>
<p>for you to fetch changes up to 7c7e41496d6b28a4b1a516c0df9754c03447a95d:</p>
<p>st: use power-of-two sizes to avoid slow modulo ops (2014-01-19 23:45:52 +0000)</p>
<hr>
<p>Eric Wong (1):<br>
st: use power-of-two sizes to avoid slow modulo ops</p>
<p>st.c | 61 +++++--------------------------------------------------------<br>
1 file changed, 5 insertions(+), 56 deletions(-)</p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=444542014-01-20T17:48:44Zshyouhei (Shyouhei Urabe)shyouhei@ruby-lang.org
<ul></ul><p>Impressive.</p>
<p>Plus I think you have more rooms for optimizations by sticking to<br>
power-of-two sized bins. When you rehash a hash, right now all<br>
elements are cleaned up from the bin, resize it, then insert them<br>
again one-by-one. If number of bins are always 2^n, rehashing is<br>
to resize a 2^n array to 2^(n+1). This case the elements stored in<br>
new_bins[2^n+k] can only come from new_bins[k]. This fact does not<br>
change the algorithmic complexity, but can reduce insertions.</p>
<p><a href="https://github.com/shyouhei/ruby/commit/f7ca891" class="external">https://github.com/shyouhei/ruby/commit/f7ca891</a></p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=444592014-01-21T02:41:41Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:shyouhei@ruby-lang.org" class="email">shyouhei@ruby-lang.org</a> wrote:</p>
<blockquote>
<p>Plus I think you have more rooms for optimizations by sticking to<br>
power-of-two sized bins. When you rehash a hash, right now all<br>
elements are cleaned up from the bin, resize it, then insert them<br>
again one-by-one. If number of bins are always 2^n, rehashing is<br>
to resize a 2^n array to 2^(n+1). This case the elements stored in<br>
new_bins[2^n+k] can only come from new_bins[k]. This fact does not<br>
change the algorithmic complexity, but can reduce insertions.</p>
<p><a href="https://github.com/shyouhei/ruby/commit/f7ca891" class="external">https://github.com/shyouhei/ruby/commit/f7ca891</a></p>
</blockquote>
<p>Thanks! However, I wasn't able to show a difference with<br>
"make benchmark"[1]. Were you?</p>
<p>Perhaps rehash is not called often enough, and I think a branch<br>
inside the loop is difficult for the CPU to optimize. I think<br>
the current dumb loop is very good for CPU pipelining and prefetch.</p>
<p>[1] I have applied my patches for improved benchmark consistency:<br>
<a href="https://bugs.ruby-lang.org/issues/5985#change-44442" class="external">https://bugs.ruby-lang.org/issues/5985#change-44442</a><br>
<a href="https://bugs.ruby-lang.org/issues/9430" class="external">https://bugs.ruby-lang.org/issues/9430</a></p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=444882014-01-22T03:21:59Zshyouhei (Shyouhei Urabe)shyouhei@ruby-lang.org
<ul></ul><p>On 01/21/2014 11:38 AM, Eric Wong wrote:</p>
<blockquote>
<blockquote>
<p><a href="https://github.com/shyouhei/ruby/commit/f7ca891" class="external">https://github.com/shyouhei/ruby/commit/f7ca891</a></p>
</blockquote>
<p>Thanks! However, I wasn't able to show a difference with<br>
"make benchmark"[1]. Were you?</p>
<p>Perhaps rehash is not called often enough, and I think a branch<br>
inside the loop is difficult for the CPU to optimize. I think<br>
the current dumb loop is very good for CPU pipelining and prefetch.</p>
</blockquote>
<p>No, sorry I see no evident speedup. When I wrote the patch I thought the<br>
function was used for Hash#rehash, but it turned out Hash#rehash uses<br>
something different (don't know why). The optimization is valid I<br>
believe but in fact used very rarely.</p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=444992014-01-22T08:12:52Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p>Urabe Shyouhei <a href="mailto:shyouhei@ruby-lang.org" class="email">shyouhei@ruby-lang.org</a> wrote:</p>
<blockquote>
<p>No, sorry I see no evident speedup. When I wrote the patch I thought the<br>
function was used for Hash#rehash, but it turned out Hash#rehash uses<br>
something different (don't know why). The optimization is valid I<br>
believe but in fact used very rarely.</p>
</blockquote>
<p>Alright. My understanding is branch mispredict costs are higher than<br>
the memory stores which would be avoided. The expensive part is loading<br>
memory on cache miss, and that is not avoided.</p>
<p>We'll probably need to poke around with perf or similar tools to<br>
analyze/confirm this.</p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=448672014-02-01T08:02:50Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p>Notes to self (or anybody else with free time right now):<br>
test and verify compare_by_identity performance</p>
<p>I'm comfortable that ID, string and most objects will hash well with<br>
power-of-two; but compare_by_identity, weakmap, vm->living_threads may<br>
hash less well without a prime modulo (or maybe they hash badly<br>
regardless of modulo!)</p>
<p>I'm also interested in using a doubly-linked list (ccan/list[1])<br>
for vm->living_threads (and possibly other places).</p>
<p>[1] BSDL version of the Linux kernel list.h at <a href="http://ccodearchive.net/" class="external">http://ccodearchive.net/</a></p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=455862014-03-03T06:33:20Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p>
<blockquote>
<p>test and verify compare_by_identity performance</p>
<p>I'm comfortable that ID, string and most objects will hash well with<br>
power-of-two; but compare_by_identity, weakmap, vm->living_threads may<br>
hash less well without a prime modulo (or maybe they hash badly<br>
regardless of modulo!)</p>
</blockquote>
<p>OK, I was right about compare_by_identity being worse with power-of-two,<br>
but I fixed it by tweaking numhash:</p>
<p><a href="http://bogomips.org/ruby.git/patch?id=1579e9d0d82789" class="external">http://bogomips.org/ruby.git/patch?id=1579e9d0d82789</a></p>
<p>I was wrong about IDs hashing well before, they hash OK now :)</p>
<p>results: <a href="http://80x24.org/bmlog-20140303-034047.26775.txt" class="external">http://80x24.org/bmlog-20140303-034047.26775.txt</a><br>
the hash parts:<br>
hash_aref_miss 1.048<br>
hash_aref_str 1.162<br>
hash_aref_sym 1.000<br>
hash_flatten 1.092<br>
hash_ident_num 1.007<br>
hash_ident_obj 1.098<br>
hash_ident_str 1.106<br>
hash_ident_sym 1.018<br>
hash_keys 1.000<br>
hash_shift 1.011<br>
hash_values 1.011<br>
vm2_bighash* 1.183</p>
<p>These numbers are from my weaker AMD FX-8320 which gave me worse numbers<br>
than my Haswell machine in my original test. I'll try to test on my<br>
Haswell machine soon (network outage there :<).</p>
<p>I'd like to commit the following three patches soon:<br>
<a href="http://bogomips.org/ruby.git/patch?id=a3fde671ffeec8" class="external">http://bogomips.org/ruby.git/patch?id=a3fde671ffeec8</a> new hash benchmarks<br>
<a href="http://bogomips.org/ruby.git/patch?id=8f155afef61342" class="external">http://bogomips.org/ruby.git/patch?id=8f155afef61342</a> original patch<br>
<a href="http://bogomips.org/ruby.git/patch?id=1579e9d0d82789" class="external">http://bogomips.org/ruby.git/patch?id=1579e9d0d82789</a> numhash tweak</p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=458592014-03-18T09:12:01Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p>
<blockquote>
<pre><code> hash_aref_sym 1.000
</code></pre>
</blockquote>
<p>Lack of improvement here was disappointing since symbol keys are common,<br>
and this showed a regression on my x86 (32-bit) VMs.</p>
<p>I tweaked rb_any_hash to be symbol-aware:</p>
<pre><code>http://bogomips.org/ruby.git/patch?id=497ed6355
</code></pre>
<p>12-30% improvement on this test from trunk depending on CPU so far \o/<br>
(Phenom II X4 got 30%, newer/faster x86-64 CPUs show less speedup).</p>
<p>I'm comfortable with improvements of this series on x86 VMs running on<br>
x86-64 (and of course native x86-64).</p>
<p>Can anybody with real 32-bit hardware verify this series? Not sure I<br>
can trust VM results; my remaining x86 hardware is on its last legs<br>
and showing occasional HW errors.</p>
<p>git://80x24.org/ruby.git st-noprime-v4</p>
<pre><code> st: use power-of-two sizes to avoid slow modulo ops
add hash benchmarks
st.c: tweak numhash to match common Ruby use cases
hash.c: improve symbol hash distribution
</code></pre> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=458602014-03-18T09:48:56Zfunny_falcon (Yura Sokolov)funny.falcon@gmail.com
<ul></ul><p>Can you test this</p>
<ul>
<li>if (SYMBOL_P(a)) a = (a >> RUBY_SPECIAL_SHIFT) ^ (a >><br>
(RUBY_SPECIAL_SHIFT + 3)); /* 3=ID_SCOPE_SHIFT */</li>
</ul>
<p>2014-03-18 13:02 GMT+04:00 Eric Wong <a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a>:</p>
<blockquote>
<p><a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p>
<blockquote>
<pre><code> hash_aref_sym 1.000
</code></pre>
</blockquote>
<p>Lack of improvement here was disappointing since symbol keys are common,<br>
and this showed a regression on my x86 (32-bit) VMs.</p>
<p>I tweaked rb_any_hash to be symbol-aware:</p>
<pre><code> http://bogomips.org/ruby.git/patch?id=497ed6355
</code></pre>
<p>12-30% improvement on this test from trunk depending on CPU so far \o/<br>
(Phenom II X4 got 30%, newer/faster x86-64 CPUs show less speedup).</p>
<p>I'm comfortable with improvements of this series on x86 VMs running on<br>
x86-64 (and of course native x86-64).</p>
<p>Can anybody with real 32-bit hardware verify this series? Not sure I<br>
can trust VM results; my remaining x86 hardware is on its last legs<br>
and showing occasional HW errors.</p>
<p>git://80x24.org/ruby.git st-noprime-v4</p>
<pre><code> st: use power-of-two sizes to avoid slow modulo ops
add hash benchmarks
st.c: tweak numhash to match common Ruby use cases
hash.c: improve symbol hash distribution
</code></pre>
</blockquote> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=458622014-03-18T19:48:47Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:funny.falcon@gmail.com" class="email">funny.falcon@gmail.com</a> wrote:</p>
<blockquote>
<p>Can you test this</p>
<ul>
<li>if (SYMBOL_P(a)) a = (a >> RUBY_SPECIAL_SHIFT) ^ (a >><br>
(RUBY_SPECIAL_SHIFT + 3)); /* 3=ID_SCOPE_SHIFT */</li>
</ul>
</blockquote>
<p>Seems to hurt performance on x86 compared to my original. I don't think<br>
ID scope bits are useful for this case. From what I've seen, Ruby users<br>
tend to limit symbol names to \w when using the Hash class and keywords.</p> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=459032014-03-22T23:34:32ZAnonymous
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Closed</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset r45384.</p>
<hr>
<p>st.c: use power-of-two sizes to avoid slow modulo ops</p>
<ul>
<li>st.c (hash_pos): use bitwise AND to avoid slow modulo op<br>
(new_size): power-of-two sizes for hash_pos change<br>
(st_numhash): adjust for common keys due to lack of prime modulo<br>
[Feature <a class="issue tracker-2 status-5 priority-4 priority-default closed" title="Feature: [PATCH] st: use power-of-two sizes to avoid slow modulo ops (Closed)" href="https://bugs.ruby-lang.org/issues/9425">#9425</a>]</li>
<li>hash.c (rb_any_hash): right shift for symbols</li>
<li>benchmark/bm_hash_aref_miss.rb: added to show improvement</li>
<li>benchmark/bm_hash_aref_sym_long.rb: ditto</li>
<li>benchmark/bm_hash_aref_str.rb: ditto</li>
<li>benchmark/bm_hash_aref_sym.rb: ditto</li>
<li>benchmark/bm_hash_ident_num.rb: added to prevent regression</li>
<li>benchmark/bm_hash_ident_obj.rb: ditto</li>
<li>benchmark/bm_hash_ident_str.rb: ditto</li>
<li>benchmark/bm_hash_ident_sym.rb: ditto</li>
</ul> Ruby master - Feature #9425: [PATCH] st: use power-of-two sizes to avoid slow modulo opshttps://bugs.ruby-lang.org/issues/9425?journal_id=459042014-03-22T23:48:53Znormalperson (Eric Wong)normalperson@yhbt.net
<ul></ul><p><a href="mailto:normalperson@yhbt.net" class="email">normalperson@yhbt.net</a> wrote:</p>
<blockquote>
<p>OK, I was right about compare_by_identity being worse with power-of-two,<br>
but I fixed it by tweaking numhash:</p>
<p><a href="http://bogomips.org/ruby.git/patch?id=1579e9d0d82789" class="external">http://bogomips.org/ruby.git/patch?id=1579e9d0d82789</a></p>
</blockquote>
<p>I tweaked that further, adding +3 instead of +1 to RUBY_SPECIAL_SHIFT<br>
in r45384. Also updated NEWS in case some extensions need tweaking for<br>
performance.</p>
<p>Haswell Xeon E3-1230 v3 numbers:</p>
<p>hash_aref_miss 1.166<br>
hash_aref_str 1.167<br>
hash_aref_sym 1.224<br>
hash_aref_sym_long 1.270<br>
hash_flatten 1.656<br>
hash_ident_num 1.142<br>
hash_ident_obj 1.193<br>
hash_ident_str 1.194<br>
hash_ident_sym 1.171<br>
hash_keys 1.002<br>
hash_shift 1.122<br>
hash_values 1.006<br>
loop_whileloop2 1.001<br>
vm2_bighash* 1.233</p>