[druntime] TypeInfo.getHash problems

Thu Mar 1 00:36:31 PST 2012

On 01.03.2012 1:45, H. S. Teoh wrote:
> I found the following bugs with the current implementations of getHash
> in the various TypeInfo classes:
>
> 1) TypeInfo_Array.getHash doesn't work correctly with class objects.
> with custom toHash() methods. Or, for that matter, arrays of AA's (which
> have their own problems--see (3a)). It just does a hash of the binary
> representation of the array.
>
> 2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does
> suffer from another problem: it simply sums the hashes from each
> element, but this is prone to reordering hash collisions: an int[3] with
> elements [1,2,3] will have the same hash value as [3,2,1] or any
> reordering thereof.
>
> 3) TypeInfo_AssociativeArray.getHash is not implemented, so it just
> defaults to the base class getHash, which just does a hash of the binary
> representation of the AA. Unfortunately, this is fraught with problems:
>
> 	a) Two AA's with exactly the same key/value pairs may be binary
> 	unequal, because the underlying size of the hash table may be
> 	different (this happens when you initialize an AA with a literal
> 	vs. manually setting each element).
>
> 	b) If values are objects that implement toHash(), then the
> 	current implementation will have the wrong behaviour.
>
> 4) TypeInfo_Class.getHash will call the object's toHash method, if
> implemented, otherwise it hashes the binary representation of the
> object. But if the object has members that are references/pointers or
> other objects/structs that implement getHash, this will be wrong.
>
> These problems cause AA's to have many subtle breakages, for example,
> using another AA as key doesn't work properly because of (3); using
> arrays of objects as key doesn't work because of (1), using static
> arrays as key may trigger an unusually high number of hash collisions
> due to (2).
>
> To fix (1) may cause noticeable slowdown when the key is an array of
> basic types, such as dstring (we'd have to compute the hash of each
> dchar, then compute the hash of the hashes).  Is there an easy way to
> tell whether or not an array's elements are binary-hashable? If so, we
> can still use hashing of the binary representation of the array for the
> simple cases (dstring, wstring, array of structs without nested
> refs/objects), and save the slower code for more complicated arrays.
>

std.traits hasIndirections?

> For (2), we may be able to incorporate the array index into the hash
> summation so as to make it resistant to reordering collisions. But it
> may be that we might as well just compute the hash of the hashes at this
> point?

I suspect arrays should be usually hashed as strings, e.g. a simplistic
hash = seed;
foreach(x; arr)
    hash = mix(hash, x);//if x is not integral use x.toHash()
hash = mix(hash, arr.length);

where mix(x) is k*x+b, or (x<<k) ^ b or whatever.

>
> For (3), we can collect the hashes of the keys (already stored in the
> bucket) and the hashes of the values, and doing a hash on them. (This
> seems to be the easiest one to fix?)
>
Or hash them as array of pairs, but whatever is faster.

> To fix (4) will require compile-time reflection... but may introduce
> slowdowns? Is there a way to detect the easy cases (binary-hashable) so
> that we only incur slowdown on the complicated cases?
>

see  point 1.

-- 
Dmitry Olshansky