Less coding

Mon Feb 25 18:01:35 PST 2008

D is a system language, so probably most of you are quite interested in fast running programs. I understand that some people here may not appreciate or be interested in some of the qualities of the scripting languages (as Python and the like). But I can show a little example. It's an extreme example, uncommon, and you may think I have cheated a bit, so you don't have to take it too much seriously, but it looks nice and interesting.

I have seen this C code:
http://nickmudge.info/c/hashtest.c.txt

It tests how many collisions are produced by seven different hash functions. It doesn't test their running time (and the demo strings are "wrong", because they are random, while it's better to test them on some real text), but for me it was interesting enough to do a D translation. 
It generates an array of distinct random strings of random length, computes hash values for them, and counts how many repeated results there are. It contains functions like:

linear_array_search
check_key_duplicates
get_random_character
make_random_strings

The original code tests only hash7, but I have added tests for all of them. Those four functions, plus the main() require about 80 lines (almost 95 if you don't use function pointers to shorten the code that calls the various hashing functions).

Using my d libs I have translated those four functions plus the main like this:

void main() {
    fastSeed = 10;

    string chars = letters ~ digits;
    int[string] tests;
    do {
        auto txt = map((int i){ return fastChoice(chars);}, xrange(fastRandInt(1, 24)));
        tests[txt] = 1;
    } while (tests.length < 10000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7]) {
        auto keys = fromKeys(xmap(func, tests), 0);
        putr("hash", nf+2, ": ", tests.length - keys.length);
    }
}

Explanations of the code:
- map() takes a function, and an one or more iterable(s), and it return an array of the function applied to every item of the iterable. If the iterable has a length attribute, it pre-allocates the result, so with a bit of inlining it's often almost as fast as a normal loop.
- xrange() return an iterable object, it just returns successive numbers in an interval, with a step too.
- randInt is fast enough, and it gives an integer in the closed integer interval. The seed is set to the current time() by default, but you may set the seed too.
- xmap() is like map, but like all the x-something it returns a lazy iterable object, that yields the items mapped by the given function.
- fromKeys() takes an iterable and a constant, and it returns an associative array with constant keys (it's similar to the useful fromkeys() method of python dicts. I hope D AAs will grow some useful methods).

The following code is for an unreleased yet version of my d libs, I am writing a set()/Set(), that is a class (and function helper) that just wraps D AAs to give a useful set semantics with many simple but useful methods:

void main() {
    string chars = letters ~ digits;
    auto tests = Set!(int);
    do {
        auto txt = map((int i){ return choice(chars);}, xrange(randInt(1, 24)));
        tests.add(txt);
    } while (tests.length < 50000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7])
        putr("hash", nf+2, ": ", tests.length - set(xmap(func, tests)).length);
}

Notes:
- On my PC with DMD (the version of the code without set()) the creation of the tests[] requires about 1 second with n=50_000. I have seen that using a lower level syle of coding it can be lowered to about 0.5 seconds (in most cases the speed loss is lower than 50%), but it's just initialization time, it runs only once, so I don't care much for for the extra half second. Losing that half second I gain a shorter code, that's faster to write and debug.
- The original C version uses linear scans, instead of the D hashes, so it's much slower, on my PC the D version is 15 times faster than the C one, with N = 10000, compiled with MinGW 4.2.1 (-O3) and the latest DMD.
- The faster Python version (with the Psyco JIT and array.array) needs about 3.5 seconds to generate the test data alone, with n=50000 still.
- I have removed the hash1 from all tests because it was too much slow for the C version.
- If I replace choice(chars) with choice(letters ~ digits) the code is becomes 4-5 times slower, because DMD 1.026 doesn't join those two constant strings (dynamic arrays) at compile/initialization time.
- I know that that C code runs everywhere, and it doesn't need external libs, while my code needs a lib, and I know you can write good libs in C too, that can help you shorten & simplify you code a lot. And I know you can write hashes in C too, or you can use the GNU data structures already written.
- The running time for the hash functions is lower for the C version, because GCC optimizes that code better.

Bye,
bearophile