Performance of hashes and associative arrays

Wed Nov 7 04:10:53 PST 2012

On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse 
wrote:

> ==
>     override size_t toHash() const
>     {
>         return (typeid(firstName).getHash(&firstName) +
>                 typeid(lastName).getHash(&lastName));
>     }
> ==
>

Isn't the real problem the addition. You want to mix the bits 
together in a consistent way that does not have associativity. 
I've seen things like:

result = 37 * first + 17 * second

I've incorporated this functionality in a toHash that can be 
mixed in. Any feedback on it would be great. See:

http://forum.dlang.org/post/tbjrqaogxegtyexnfags@forum.dlang.org
https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d

It only supports structs, but the hashing could be made to 
support classes. It allows for the following code for your 
example. It provides toHash, opEquals and opCmp. Below the 
hashcodes for s1 and s2 are different, the instances are not 
equal and s1 is less than s2 which is what you want.

import std.stdio;
import opmix.mix;

struct Student // class representing a student
{
   mixin(HashSupport);
   string firstName; // the first name of the student
   string lastName; // the last name of the student
}

void main() {
   auto s1 = Student("Jean", "Martin");
   auto s2 = Student("Martin", "Jean");
   writeln(s1.toHash(), " vs ", s2.toHash());
   writeln(s1 == s2);
   writeln(s1 < s2);
}

> However, with this solution, we get the same hash for new 
> Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
> did an experiment of performance of associative arrays in D 
> when the hash function is not well designed (returning the same 
> hash for too many values). When the hash function returns the 
> same hash for too many values, performance are dramatic (See 
> the end of the post for more information).

This makes sense and is not particular to D.

> Ali agreed that concatenating strings each time would indeed be 
> inefficient. He thought we might cache the value (third 
> solution) :

Interesting about caching the hashcode and on large classes could 
save you. But isn't the signature shown const? Would that 
compile? Also, what if you change the strings - you get the wrong 
hash? I suppose you could limit writes to properties and null the 
hashcode on any write.

> Questions are :
>  - what is the most efficient solution, and in which case ?

No string concatenation is good. I think a single pass on all 
important data (in most cases is all the data) is the goal.

> This seems to be an extreme case, we might think results would 
> be completely different from a function giving hashes that 
> "sometimes" represents 2 elements.

Very true. If I increase to 10_000_000 I see opCmp:1913600 for 
the smart method and (several minutes later) opCmp:907216764 for 
the simple addition (method 1).
In this case you know something special about the data and can 
take advantage of it. If I run the example using the 
mixin(HashSupport) it does opCmp:7793499 which is about 4 times 
as many compares as the smart one. The simple addition does 474 
times as many compares as the smart one - so it is clearly very 
bad. So, if you know something special about the data, like it 
can easily be converted into a single number such as seconds, by 
all means include that. But remember, next time you add something 
to the class, if you don't have some "automated" way of pulling 
in info from all fields you need to revisit the hash (and opCmp 
and opEquals).

Thanks
Dan