[druntime] TypeInfo.getHash problems
Dmitry Olshansky
dmitry.olsh at gmail.com
Thu Mar 1 00:36:31 PST 2012
On 01.03.2012 1:45, H. S. Teoh wrote:
> I found the following bugs with the current implementations of getHash
> in the various TypeInfo classes:
>
> 1) TypeInfo_Array.getHash doesn't work correctly with class objects.
> with custom toHash() methods. Or, for that matter, arrays of AA's (which
> have their own problems--see (3a)). It just does a hash of the binary
> representation of the array.
>
> 2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does
> suffer from another problem: it simply sums the hashes from each
> element, but this is prone to reordering hash collisions: an int[3] with
> elements [1,2,3] will have the same hash value as [3,2,1] or any
> reordering thereof.
>
> 3) TypeInfo_AssociativeArray.getHash is not implemented, so it just
> defaults to the base class getHash, which just does a hash of the binary
> representation of the AA. Unfortunately, this is fraught with problems:
>
> a) Two AA's with exactly the same key/value pairs may be binary
> unequal, because the underlying size of the hash table may be
> different (this happens when you initialize an AA with a literal
> vs. manually setting each element).
>
> b) If values are objects that implement toHash(), then the
> current implementation will have the wrong behaviour.
>
> 4) TypeInfo_Class.getHash will call the object's toHash method, if
> implemented, otherwise it hashes the binary representation of the
> object. But if the object has members that are references/pointers or
> other objects/structs that implement getHash, this will be wrong.
>
> These problems cause AA's to have many subtle breakages, for example,
> using another AA as key doesn't work properly because of (3); using
> arrays of objects as key doesn't work because of (1), using static
> arrays as key may trigger an unusually high number of hash collisions
> due to (2).
>
> To fix (1) may cause noticeable slowdown when the key is an array of
> basic types, such as dstring (we'd have to compute the hash of each
> dchar, then compute the hash of the hashes). Is there an easy way to
> tell whether or not an array's elements are binary-hashable? If so, we
> can still use hashing of the binary representation of the array for the
> simple cases (dstring, wstring, array of structs without nested
> refs/objects), and save the slower code for more complicated arrays.
>
std.traits hasIndirections?
> For (2), we may be able to incorporate the array index into the hash
> summation so as to make it resistant to reordering collisions. But it
> may be that we might as well just compute the hash of the hashes at this
> point?
I suspect arrays should be usually hashed as strings, e.g. a simplistic
hash = seed;
foreach(x; arr)
hash = mix(hash, x);//if x is not integral use x.toHash()
hash = mix(hash, arr.length);
where mix(x) is k*x+b, or (x<<k) ^ b or whatever.
>
> For (3), we can collect the hashes of the keys (already stored in the
> bucket) and the hashes of the values, and doing a hash on them. (This
> seems to be the easiest one to fix?)
>
Or hash them as array of pairs, but whatever is faster.
> To fix (4) will require compile-time reflection... but may introduce
> slowdowns? Is there a way to detect the easy cases (binary-hashable) so
> that we only incur slowdown on the complicated cases?
>
see point 1.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list