Performance of hashes and associative arrays
Dan
dbdavidson at yahoo.com
Wed Nov 7 04:10:53 PST 2012
On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse
wrote:
> ==
> override size_t toHash() const
> {
> return (typeid(firstName).getHash(&firstName) +
> typeid(lastName).getHash(&lastName));
> }
> ==
>
Isn't the real problem the addition. You want to mix the bits
together in a consistent way that does not have associativity.
I've seen things like:
result = 37 * first + 17 * second
I've incorporated this functionality in a toHash that can be
mixed in. Any feedback on it would be great. See:
http://forum.dlang.org/post/tbjrqaogxegtyexnfags@forum.dlang.org
https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d
It only supports structs, but the hashing could be made to
support classes. It allows for the following code for your
example. It provides toHash, opEquals and opCmp. Below the
hashcodes for s1 and s2 are different, the instances are not
equal and s1 is less than s2 which is what you want.
import std.stdio;
import opmix.mix;
struct Student // class representing a student
{
mixin(HashSupport);
string firstName; // the first name of the student
string lastName; // the last name of the student
}
void main() {
auto s1 = Student("Jean", "Martin");
auto s2 = Student("Martin", "Jean");
writeln(s1.toHash(), " vs ", s2.toHash());
writeln(s1 == s2);
writeln(s1 < s2);
}
> However, with this solution, we get the same hash for new
> Student("Jean", "Martin") and new Student("Martin", "Jean"). We
> did an experiment of performance of associative arrays in D
> when the hash function is not well designed (returning the same
> hash for too many values). When the hash function returns the
> same hash for too many values, performance are dramatic (See
> the end of the post for more information).
This makes sense and is not particular to D.
> Ali agreed that concatenating strings each time would indeed be
> inefficient. He thought we might cache the value (third
> solution) :
Interesting about caching the hashcode and on large classes could
save you. But isn't the signature shown const? Would that
compile? Also, what if you change the strings - you get the wrong
hash? I suppose you could limit writes to properties and null the
hashcode on any write.
> Questions are :
> - what is the most efficient solution, and in which case ?
No string concatenation is good. I think a single pass on all
important data (in most cases is all the data) is the goal.
> This seems to be an extreme case, we might think results would
> be completely different from a function giving hashes that
> "sometimes" represents 2 elements.
Very true. If I increase to 10_000_000 I see opCmp:1913600 for
the smart method and (several minutes later) opCmp:907216764 for
the simple addition (method 1).
In this case you know something special about the data and can
take advantage of it. If I run the example using the
mixin(HashSupport) it does opCmp:7793499 which is about 4 times
as many compares as the smart one. The simple addition does 474
times as many compares as the smart one - so it is clearly very
bad. So, if you know something special about the data, like it
can easily be converted into a single number such as seconds, by
all means include that. But remember, next time you add something
to the class, if you don't have some "automated" way of pulling
in info from all fields you need to revisit the hash (and opCmp
and opEquals).
Thanks
Dan
More information about the Digitalmars-d-learn
mailing list