rb_str_hash_m discards bits from the hash
I believe rb_str_hash_m might discard some bits from the hash value in some situations.
It computes the hash as a st_index_t, which is either a unsigned long or a unsigned long long.
But the st_index_t value is converted to a VALUE with:
#define ST2FIX(h) LONG2FIX((long)(h))
Note that for instance on x64-mingw32, SIZEOF_LONG is 4, but SIZEOF_LONG_LONG and SIZEOF_VOIDP are 8 bytes.
So that truncates half the bits of the hash on such a platform if my understanding is correct.
Even is SIZEOF_LONG is 8, LONG2FIX loses the MSB I think, given that not all long can fit the Fixnum range on MRI (should it be LONG2NUM?).
Also, I am not sure if it is intended to cast from a unsigned value to a signed value.
I tried many things while debugging the rb_str_hash spec on ruby/spec and eventually gave up.
This computation looks wrong to me in MRI.
For info, here is my debug code:
and the build result on AppVeyor:
Updated by duerst (Martin Dürst) over 2 years ago
I don't think there is any guarantee for the length of a hash value in Ruby. It's just assumed it's long enough to not lead to overly many collisions.
Also, if the calculation of the original value (before throwing away bits) is really good (i.e. all bits of the input affect all bits of the output,...), then when there is a need to shorten a hash value, which bits are being thrown away shouldn't make any difference. (sorry, quite a few ifs)
Updated by shyouhei (Shyouhei Urabe) over 2 years ago
We looked at this issue at yesterday's developer meeting.
The attendees agree that current behaviour is intentional. Because creating Bignums every time is too slow for this experiential hot spot, we want to avoid such thing and stick to Fixnum.
If you (or other cryptography experts) have any concerns on information security by cutting bits of hash values, please tell us a bit more details about what's wrong.
Updated by Eregon (Benoit Daloze) over 2 years ago
I think the case where half the bits are lost could become a potential security issue.
Essentially all strings which have the same first half will collide in a Hash, and that's likely trivial to generate
(the same prefix/suffix of the right length is likely to generate the same half).
In that case (sizeof(long) < sizeof(void*)), I think at least the two parts should be combined with something like (long)(value ^ (value >> 32)).
But I am not a security expert.