[Issue 2922] New: Egregiously bad hashing performance with strings

Sat May 2 10:42:10 PDT 2009

http://d.puremagic.com/issues/show_bug.cgi?id=2922

           Summary: Egregiously bad hashing performance with strings
           Product: D
           Version: 2.029
          Platform: PC
        OS/Version: Windows
            Status: NEW
          Keywords: performance
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla at digitalmars.com
        ReportedBy: dsimcha at yahoo.com

The current hashing scheme for strings causes massive hash collisions
unnecessarily.  Here's a small program that generates random strings, hashes
them, and tracks how frequently each key is used, to demonstrate this issue.

import std.stdio, std.random, std.algorithm, std.range;

void main() {
    uint[hash_t] hashFrq;
    bool[string] alreadyUsed;

    // Generate 100k random strings, compute their hashes and count the
    // frequency of each hash.
    foreach(i; 0..100_000) {
        string myRandString;

        // Make sure that no two strings are equal.
        do {
            myRandString = randString();
        } while(myRandString in alreadyUsed);
        alreadyUsed[myRandString] = true;

        hash_t hash = typeid(string).getHash(cast(void*) &myRandString);
        hashFrq[hash]++;
    }

    auto hashes = hashFrq.keys;
    auto counts = hashFrq.values;
    sort!((a, b) { return a.at!0 < b.at!0; })(zip(counts, hashes));
    foreach(i; 0..hashes.length) {
        writeln(hashes[i], "\t", counts[i]);
    }

    writeln(hashes.length, " unique hashes used.");
}

// Generates a random string of geometrically distributed length composed
// of uniformly distributed characters in [A-Z, a-z].  Expected length is 20.
string randString() {
    bool upperCase;
    bool keepGoing = true;
    string ret;

    uint upperA = cast(uint) 'A';
    uint upperZ = cast(uint) 'Z' + 1;
    uint lowerA = cast(uint) 'a';
    uint lowerZ = cast(uint) 'z' + 1;

    while(keepGoing) {
        upperCase = cast(bool) uniform!"[]"(0, 1);
        ret ~= cast(char)
            (upperCase ? uniform(upperA, upperZ) : uniform(lowerA, lowerZ));
        keepGoing = uniform(0.0, 1.0) > 0.05;
    }
    return ret;
}

The result is that only about 8400 unique hashes are used, meaning 90+ % of
keys cannot be stored directly in the position corresponding to their hash.
Note that this is in full hash_t space, not in the modulus space typically
used for AAs, so the real results are probably even worse.  If the hashes were
uniformly distributed across all possible values of hash_t, one only would
expect about 30 collisions, meaning about 99970 unique hashes.  (This is based
on monte carlo, not exact combinatorics, so it's only a rough figure, but it
doesn't really matter for the take-home message anyhow.) 

It appears that the current hashing scheme is to simply add up the integer
character codes for each byte in the array, and return this as the hash.  This
implementation is the default for an array of objects.  Even something simple
like rotating the hash value before each add would be a significant
improvement.

--