Superhash buried in Druntime does super work.

Wed Jul 28 07:14:40 PDT 2010

I noted a little while ago here a mention of the superhash, someone was 
curious about it, and I found it hidden in the D runtime library

 Superhash , published online, From "Hash Functions" , Paul Hsieh
    http://www.azillionmonkeys.com/qed/hash.htm.

Since I had an almost ready made hash distribution and benchmark test in 
the dsource/projects/aa , and I somehow came to look at it again, I tried 
it out.

I compared three hashing functions for a test class "WrapString", 
wrapping a D2 string.

"TypeInfo.getHash" - the simple string hash used by D version(all). (uses 
number 11)

"superHash" - my transcription of the above Hsieh string function into D.
	      then I noticed that D uses this very same function in hashOf
() in the module rt.util.hash.

"Java hash" - the simple string hash as used by Java. (uses number 31)

I also varied wether the hash was pre-caculated and cached, or was 
recalculated each toHash() call, to get an indication for the calculation 
overhead difference.

The benchmark was compiled as release and with optimize flag.

With hash functions, the distribution of the data makes a difference, so 
randomly constructed data behaves differently from more 'patterned' 
data.  Patterned data might be expected to perform worse on not so good 
hash functions.  Each run of the benchmark program used the same stored 
randomized data set of 1,000,000 valid strings of random length between 1 
- 20 printable random characters distributed uniformly between 0x20 and 
0x7E. The hashOf was passed the string length as a hash seed value. So 
this data maybe is not patterned enough to show big differences.

The distribution bucket chain lengths...

	TypeInfo.getHash    Java hash 31	SuperHash (hashOf)

1	50.90			52.07		52.97
2	31.03			33.02		33.70
3	10.11			11.23		10.64
4	 2.89			 2.91		 2.25
5	 1.41			 0.62		 0.39
6	 1.02			 0.14		 0.04
7	 0.72			 0.02		 0.00
8	 0.66			 0.00	 	 0.00
9	 0.67
10	 0.38
11	 0.13
12	 0.04
13	 0.01
14	 0.00

Now for some hashtable timings.

Using cached hash calculated by... (average of 10 / standard deviation )

	TypeInfo.getHash 11	Java hash 31 	SuperHash (hashOf)

insert	2.36 / 0.113		2.50 / 0.173	2.37 / 0.089		
lookup	3.19 / 0.232		2.91 / 0.087	2.73 / 0.107

The super hash gives the most random distribution of buckets and the best 
lookup times.
The simple string hash using number 31 does better than using number 11.
The differences are not very big. Insertion times are no different. But 
the distribution seems to 
affect lookups in the expected direction, and this replicates well enough.

See how hash recalculation affects results.  A hash calculation is done 
once for each insertion and 
lookup.

	TypeInfo.getHash 11	Java hash 31 	Superhash (hashOf)

insert	2.98 / 0.080		3.11 / 0.212	3.49 / 0.184		
lookup	4.12 / 0.094		4.24 / 0.336	4.26 / 0.114

The difference in times between getHash 11 and Java hash 31, do not seem 
too significant. Maybe multiplying by bigger numbers takes a bit longer, 
and super hashing around takes a bit more longer.

If I substituted 11 for 31 in the WrapString class toHash function, the 
times for 11 and 31 were  equivalent.  My SuperHash transcription also 
performed much the same as the rt.util version.

For this data test set, there is not much of a time difference to choose 
between the hash functions. The Superhash hashOf does more work and is 
expected to take relatively longer, for larger slabs of 
data. On todays CPUs , most of the time is probably just spent fetching 
each character of the string from memory, before starting to hash the 
bits around. 

All the same, for the string hash, on this dataset, for both bucket 
distribution and reasonable hash time, and almost the same code, using 31 
instead of 11 as the hash multiplier seems to be a benefit 
for bucket distribution without a speed penalty.  If anyone can find or 
describe a dataset on which hash 31 performs worse than hash 11, please 
let me know.

Back to whatever I was doing before now..