Absolutely horrible default string hashing

Sat May 2 10:00:50 PDT 2009

I've been playing around with associative array implementations in D, mostly
trying to create ones that are more GC-friendly.  I had written one that
seemed much faster than the builtin for int and float keys but much slower for
strings.  I've found the reason:  The builtin AAs use trees for collision
resolution, while mine were using probing or chaining, and with strings there
were a heck of a lot of collisions because the default string hashing
algorithm is absolutely horrible.  Here's a small test program to demonstrate
this:

import std.stdio, std.random, std.algorithm, std.range;

void main() {
    uint[hash_t] hashFrq;
    bool[string] alreadyUsed;

    // Generate 100k random strings, compute their hashes and count the
    // frequency of each hash.
    foreach(i; 0..100_000) {
        string myRandString;

        // Make sure that no two strings are equal.
        do {
            myRandString = randString();
        } while(myRandString in alreadyUsed);
        alreadyUsed[myRandString] = true;

        hash_t hash = typeid(string).getHash(cast(void*) &myRandString);
        hashFrq[hash]++;
    }

    auto hashes = hashFrq.keys;
    auto counts = hashFrq.values;
    sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes));
    foreach(i; 0..hashes.length) {
        writeln(hashes[i], "\t", counts[i]);
    }

    writeln(hashes.length, " unique hashes used.");
}

// Generates a random string of geometrically distributed length composed
// of uniformly distributed characters in [A-Z, a-z].  Expected length is 20.
string randString() {
    bool upperCase;
    bool keepGoing = true;
    string ret;

    uint upperA = cast(uint) 'A';
    uint upperZ = cast(uint) 'Z' + 1;
    uint lowerA = cast(uint) 'a';
    uint lowerZ = cast(uint) 'z' + 1;

    while(keepGoing) {
        upperCase = cast(bool) uniform!"[]"(0, 1);
        ret ~= cast(char)
            (upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ));
        keepGoing = uniform(0.0, 1.0) > 0.05;
    }
    return ret;
}

The result is that only about 8400 unique hashes are used, meaning 90+ % of
keys cannot be stored directly in the position corresponding to their hash.
Note that this is in full hash_t space, not in the modulus space typically
used for AAs, so the real results are probably even worse.  If the hashes were
uniformly distributed across all possible values of hash_t, one only would
expect about 30 collisions, meaning about 99970 unique hashes.  (This is based
on monte carlo, not exact combinatorics, so it's only a rough figure, but it
doesn't really matter for the take-home message anyhow.)

Using trees instead of some simpler O(N) mechanism to resolve collisions is
very costly, in that the resulting array cannot be iterated over in O(1)
auxiliary space, and that the trees require extra space overhead to maintain,
meaning more GC pressure, etc.  If we want to fix D's AAs, we first need to
fix D's string hashing.