Superhash buried in Druntime does super work.
Michael Rynn
michaelrynn at optusnet.com.au
Wed Jul 28 07:14:40 PDT 2010
I noted a little while ago here a mention of the superhash, someone was
curious about it, and I found it hidden in the D runtime library
Superhash , published online, From "Hash Functions" , Paul Hsieh
http://www.azillionmonkeys.com/qed/hash.htm.
Since I had an almost ready made hash distribution and benchmark test in
the dsource/projects/aa , and I somehow came to look at it again, I tried
it out.
I compared three hashing functions for a test class "WrapString",
wrapping a D2 string.
"TypeInfo.getHash" - the simple string hash used by D version(all). (uses
number 11)
"superHash" - my transcription of the above Hsieh string function into D.
then I noticed that D uses this very same function in hashOf
() in the module rt.util.hash.
"Java hash" - the simple string hash as used by Java. (uses number 31)
I also varied wether the hash was pre-caculated and cached, or was
recalculated each toHash() call, to get an indication for the calculation
overhead difference.
The benchmark was compiled as release and with optimize flag.
With hash functions, the distribution of the data makes a difference, so
randomly constructed data behaves differently from more 'patterned'
data. Patterned data might be expected to perform worse on not so good
hash functions. Each run of the benchmark program used the same stored
randomized data set of 1,000,000 valid strings of random length between 1
- 20 printable random characters distributed uniformly between 0x20 and
0x7E. The hashOf was passed the string length as a hash seed value. So
this data maybe is not patterned enough to show big differences.
The distribution bucket chain lengths...
TypeInfo.getHash Java hash 31 SuperHash (hashOf)
1 50.90 52.07 52.97
2 31.03 33.02 33.70
3 10.11 11.23 10.64
4 2.89 2.91 2.25
5 1.41 0.62 0.39
6 1.02 0.14 0.04
7 0.72 0.02 0.00
8 0.66 0.00 0.00
9 0.67
10 0.38
11 0.13
12 0.04
13 0.01
14 0.00
Now for some hashtable timings.
Using cached hash calculated by... (average of 10 / standard deviation )
TypeInfo.getHash 11 Java hash 31 SuperHash (hashOf)
insert 2.36 / 0.113 2.50 / 0.173 2.37 / 0.089
lookup 3.19 / 0.232 2.91 / 0.087 2.73 / 0.107
The super hash gives the most random distribution of buckets and the best
lookup times.
The simple string hash using number 31 does better than using number 11.
The differences are not very big. Insertion times are no different. But
the distribution seems to
affect lookups in the expected direction, and this replicates well enough.
See how hash recalculation affects results. A hash calculation is done
once for each insertion and
lookup.
TypeInfo.getHash 11 Java hash 31 Superhash (hashOf)
insert 2.98 / 0.080 3.11 / 0.212 3.49 / 0.184
lookup 4.12 / 0.094 4.24 / 0.336 4.26 / 0.114
The difference in times between getHash 11 and Java hash 31, do not seem
too significant. Maybe multiplying by bigger numbers takes a bit longer,
and super hashing around takes a bit more longer.
If I substituted 11 for 31 in the WrapString class toHash function, the
times for 11 and 31 were equivalent. My SuperHash transcription also
performed much the same as the rt.util version.
For this data test set, there is not much of a time difference to choose
between the hash functions. The Superhash hashOf does more work and is
expected to take relatively longer, for larger slabs of
data. On todays CPUs , most of the time is probably just spent fetching
each character of the string from memory, before starting to hash the
bits around.
All the same, for the string hash, on this dataset, for both bucket
distribution and reasonable hash time, and almost the same code, using 31
instead of 11 as the hash multiplier seems to be a benefit
for bucket distribution without a speed penalty. If anyone can find or
describe a dataset on which hash 31 performs worse than hash 11, please
let me know.
Back to whatever I was doing before now..
More information about the Digitalmars-d
mailing list