[druntime] TypeInfo.getHash problems

Wed Feb 29 13:45:47 PST 2012

I found the following bugs with the current implementations of getHash
in the various TypeInfo classes:

1) TypeInfo_Array.getHash doesn't work correctly with class objects.
with custom toHash() methods. Or, for that matter, arrays of AA's (which
have their own problems--see (3a)). It just does a hash of the binary
representation of the array.

2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does
suffer from another problem: it simply sums the hashes from each
element, but this is prone to reordering hash collisions: an int[3] with
elements [1,2,3] will have the same hash value as [3,2,1] or any
reordering thereof.

3) TypeInfo_AssociativeArray.getHash is not implemented, so it just
defaults to the base class getHash, which just does a hash of the binary
representation of the AA. Unfortunately, this is fraught with problems:

	a) Two AA's with exactly the same key/value pairs may be binary
	unequal, because the underlying size of the hash table may be
	different (this happens when you initialize an AA with a literal
	vs. manually setting each element).

	b) If values are objects that implement toHash(), then the
	current implementation will have the wrong behaviour.

4) TypeInfo_Class.getHash will call the object's toHash method, if
implemented, otherwise it hashes the binary representation of the
object. But if the object has members that are references/pointers or
other objects/structs that implement getHash, this will be wrong.

These problems cause AA's to have many subtle breakages, for example,
using another AA as key doesn't work properly because of (3); using
arrays of objects as key doesn't work because of (1), using static
arrays as key may trigger an unusually high number of hash collisions
due to (2).

To fix (1) may cause noticeable slowdown when the key is an array of
basic types, such as dstring (we'd have to compute the hash of each
dchar, then compute the hash of the hashes).  Is there an easy way to
tell whether or not an array's elements are binary-hashable? If so, we
can still use hashing of the binary representation of the array for the
simple cases (dstring, wstring, array of structs without nested
refs/objects), and save the slower code for more complicated arrays.

For (2), we may be able to incorporate the array index into the hash
summation so as to make it resistant to reordering collisions. But it
may be that we might as well just compute the hash of the hashes at this
point?

For (3), we can collect the hashes of the keys (already stored in the
bucket) and the hashes of the values, and doing a hash on them. (This
seems to be the easiest one to fix?)

To fix (4) will require compile-time reflection... but may introduce
slowdowns? Is there a way to detect the easy cases (binary-hashable) so
that we only incur slowdown on the complicated cases?

T

-- 
Music critic: "That's an imitation fugue!"