Performance of hashes and associative arrays

Tue Nov 6 22:38:30 PST 2012

Hello everybody,

we had a conversation with Ali Cehreli about the right ways of 
hashing values / objects. This is related to the D language when 
using associative arrays but applies for any language as well.

One of the points was, we have the following class:

==
class Student // class representing a student
{
     string firstName; // the first name of the student
     string lastName; // the last name of the student
}
==

Let s be a Student:

==
auto s = new Student("Jean", "Martin");
==

We want to be able to get the hash of s. Therefore, we 
re-implement the toHash method of the Student class :

==
class Student
{
     string firstName;
     string lastName;

     override size_t toHash() const
     {
         //...
     }
}
==

The original solution proposed was:

==
     override size_t toHash() const
     {
         return (typeid(firstName).getHash(&firstName) +
                 typeid(lastName).getHash(&lastName));
     }
==

However, with this solution, we get the same hash for new 
Student("Jean", "Martin") and new Student("Martin", "Jean"). We 
did an experiment of performance of associative arrays in D when 
the hash function is not well designed (returning the same hash 
for too many values). When the hash function returns the same 
hash for too many values, performance are dramatic (See the end 
of the post for more information).

So here is the second solution that was proposed and that avoids 
having the same hash for two different names:

==
     override size_t toHash() const
     {
         auto h = firstName ~ ":" ~ lastName; // we suppose that 
":" is not a valid character in a name
         return typeid(h).getHash(&h);
     }
==

A ":" was added so new Student("Martin", "Jean") gets a different 
hash than new Student("Mar", "tinJean")).
Let's call the following solution "second bis solution" :

==
     override size_t toHash() const
     {
         auto h = firstName ~ lastName;
         return typeid(h).getHash(&h);
     }
==

This solution suppose that being named Mar tinJean instead of 
Martin Jean is really strange and therefore quite unusual.
However, these two solutions imply string concatenations each 
time we need the hash.
Ali agreed that concatenating strings each time would indeed be 
inefficient. He thought we might cache the value (third solution) 
:

==
class Student
{
     string firstName;
     string lastName;
     size_t cachedHash;

     override size_t toHash() const
     {
         if(cachedHash is null)
         {
             auto h = firstName ~ ":" ~ lastName;
             cachedHash = typeid(h).getHash(&h);
         }
         return cachedHash;
     }
}
==

This solution doesn't make compromise on the "quality" of the 
hash and is fast after the first usage but needs to keep the hash 
as part of the class. The hash is calculated for each new object 
that is used as index in an associative array, even if the hash 
for this specific name has already been calculated in another 
object.

Questions are :
  - what is the most efficient solution, and in which case ?
  - Is it preferable to try to make the hash unique (second bis 
solution), or to try to make the hash easier to evaluate and risk 
in rare case to get the same hashes for two different objects 
(second solution)?
  - Is solution 3 a good practice? Preferable to the others 
solutions? When?
  - Is solution 1 preferable to the others solutions, given that 
first name / last name inversions are quite "rare" ?
  - how compares solution 1 to solution 2 bis ?

It would also be interesting to have ideas for the general case, 
i.e. random objects and not just names that have particularities 
like "First names are not often last names and vice versa", or 
for other specific situations.
Questions are oriented for usage in associative arrays in D, but 
still apply to other use cases of hashes.

For those who are interested, here is Ali's experiment on 
performance of associative arrays in D depending on the design of 
the hash function (I took the liberty to modify it slightly).

==
import std.stdio;
import std.random;
import std.string;

class Clock
{
     int hour;
     int minute;
     int second;

     static size_t opCmpCount = 0;
     static size_t opEqualsCount = 0;
     static size_t toHashCount = 0;

     override equals_t opEquals(Object o) const
     {
         ++opEqualsCount;

         auto rhs = cast(const Clock)o;

         return (rhs &&
                 (hour == rhs.hour) &&
                 (minute == rhs.minute) &&
                 (second == rhs.second));
     }

     override int opCmp(Object o) const
     {
         ++opCmpCount;

         /* Taking advantage of the automatically-maintained
          * order of the types. */
         if (typeid(this) != typeid(o)) {
             return typeid(this).opCmp(typeid(o));
         }

         auto rhs = cast(const Clock)o;
         /* No need to check whether rhs is null, because it is
          * known on this line that it has the same type as o. */

         return (hour != rhs.hour
                 ? hour - rhs.hour
                 : (minute != rhs.minute
                    ? minute - rhs.minute
                    : second - rhs.second));
     }

     override string toString() const
     {
         return format("%02s:%02s:%02s", hour, minute, second);
     }

     this(int hour, int minute, int second)
     {
         this.hour = hour;
         this.minute = minute;
         this.second = second;
     }

     override hash_t toHash() const
     {
         ++toHashCount;
         return hour + minute + second; // method 1
         // return hour + 60 * minute + 3600 * second; // method 2
     }

     static void printCounts()
     {
         writefln("opEquals: %8s", opEqualsCount);
         writefln("opCmp   : %8s", opCmpCount);
         writefln("toHash  : %8s", toHashCount);
     }
}

Clock randomClock()
{
     return new Clock(uniform(0, 24), uniform(0, 60), uniform(0, 
60));
}

void main()
{
     enum count = 10_000;
     Clock[Clock] aa;

     foreach (i; 0 .. count) {
         auto entry = randomClock();
         aa[entry] = entry;
     }

     foreach (i; 0 .. count) {
         auto result = (randomClock() in aa);
     }

     Clock.printCounts();
}
==

We both get the following results for the method 1:

opEquals:        0
opCmp   :  1438438
toHash  :    20000

and the following results for the method 2:

opEquals:        0
opCmp   :     1681
toHash  :    20000

Apart from the fact associative arrays in gdc and dmd don't use 
the opEquals method, we see that using method 1 implies approx. 
856 times more comparisons than method 2. I calculated that 
method 1 gives 139 different hashes for 80063 different elements, 
which means each hash represents an average of approx. 576 
elements. Method 2 gives exactly one hash for one element.
This seems to be an extreme case, we might think results would be 
completely different from a function giving hashes that 
"sometimes" represents 2 elements.