[druntime] TypeInfo.getHash problems

Thu Mar 1 08:32:18 PST 2012

On 01.03.2012 18:56, H. S. Teoh wrote:
> On Thu, Mar 01, 2012 at 12:36:31PM +0400, Dmitry Olshansky wrote:
>> On 01.03.2012 1:45, H. S. Teoh wrote:
>>> I found the following bugs with the current implementations of
>>> getHash in the various TypeInfo classes:
>>>
>>> 1) TypeInfo_Array.getHash doesn't work correctly with class objects.
>>> with custom toHash() methods. Or, for that matter, arrays of AA's
>>> (which have their own problems--see (3a)). It just does a hash of the
>>> binary representation of the array.
>>>
>>> 2) TypeInfo_StaticArray.getHash doesn't suffer from (1), but it does
>>> suffer from another problem: it simply sums the hashes from each
>>> element, but this is prone to reordering hash collisions: an int[3]
>>> with elements [1,2,3] will have the same hash value as [3,2,1] or any
>>> reordering thereof.
>>>
>>> 3) TypeInfo_AssociativeArray.getHash is not implemented, so it just
>>> defaults to the base class getHash, which just does a hash of the
>>> binary representation of the AA. Unfortunately, this is fraught with
>>> problems:
>>>
>>> 	a) Two AA's with exactly the same key/value pairs may be binary
>>> 	unequal, because the underlying size of the hash table may be
>>> 	different (this happens when you initialize an AA with a literal
>>> 	vs. manually setting each element).
>>>
>>> 	b) If values are objects that implement toHash(), then the
>>> 	current implementation will have the wrong behaviour.
>>>
>>> 4) TypeInfo_Class.getHash will call the object's toHash method, if
>>> implemented, otherwise it hashes the binary representation of the
>>> object. But if the object has members that are references/pointers or
>>> other objects/structs that implement getHash, this will be wrong.
> [...]
>>> To fix (1) may cause noticeable slowdown when the key is an array of
>>> basic types, such as dstring (we'd have to compute the hash of each
>>> dchar, then compute the hash of the hashes).  Is there an easy way to
>>> tell whether or not an array's elements are binary-hashable? If so,
>>> we can still use hashing of the binary representation of the array
>>> for the simple cases (dstring, wstring, array of structs without
>>> nested refs/objects), and save the slower code for more complicated
>>> arrays.
>>>
>>
>> std.traits hasIndirections?
>
> This has to be fixed in druntime, though. Is there a druntime equivalent
> to std.traits?
>

The fact that std.traits belongs to phobos doesn't mean it's impossible 
to smuggle few artifacts to druntime :)

[..]
>
>
>>> For (3), we can collect the hashes of the keys (already stored in the
>>> bucket) and the hashes of the values, and doing a hash on them. (This
>>> seems to be the easiest one to fix?)
>>>
>> Or hash them as array of pairs, but whatever is faster.
>
> I did a quick implementation of this:
>
> 	https://github.com/D-Programming-Language/druntime/pull/171
>
> Basically do a hash of the respective hash values of the key and value,
> and sum them over all pairs. The sum is commutative, so it avoids the
> problem of two AA's with identical contents but different internal
> hashtable sizes causing pair reordering.
>

Looks like a clean solution.

>
>>> To fix (4) will require compile-time reflection... but may introduce
>>> slowdowns? Is there a way to detect the easy cases (binary-hashable)
>>> so that we only incur slowdown on the complicated cases?
>>>
>>
>> see  point 1.
> [...]
>
> It would be good if there was a clean way to do this in druntime.
>

-- 
Dmitry Olshansky