Benchmark Design (was: Why Strings as Classes?)

Thu Aug 28 06:50:46 PDT 2008

This is another small associative array benchmark, with strings. Please spot any problem/bug in it.

You can generate the data set with this Python script:

from array import array
from random import randrange, seed
def generate(filename, N):
    fout = file(filename, "w")
    seed(1)
    a = array("B", " ") * 10
    for i in xrange(N):
        n = randrange(3, 11)
        for j in xrange(n):
            a[j] = randrange(97, 123)
        print >>fout, a.tostring()[:n]
import psyco; psyco.full()
generate("words.txt", 1600000)

It generates a text file of about 13.3 MB, each line contains a random word. Such dataset isn't exactly like a real dataset, because generally words aren't random, they contain lot of redundancy that may worsen a little the performance of an hash function. So this is probably a better situation for an associative array.

The D code:

import std.stream, std.stdio, std.gc;
void main() {
    //disable();
    int[string] d;
    foreach (string line; new BufferedFile("words.txt"))
        d[line.dup] = 0;
}

Note that this program tests the I/O performance too. If you want to avoid that you can read all the file lines up-front and time just the AA creation. (SEE BELOW).

This is a first Python+Psyco version, it's not equal, because the good Python developers have chosen the newlines at the end of the words:

def main():
    d = {}
    for line in file("words.txt"):
        d[line] = 0
import psyco; psyco.full()
main()

This second slower Python version strips the lines from their newline, rstrip() is similar to std.string.stripr():

def main():
    d = {}
    for line in file("words.txt"):
        d[line.rstrip()] = 0

import psyco; psyco.full()
main()

Few timings:

N = 800_000:
	Psyco:          0.69 s
	Psyco stripped: 0.77 s
	D:              1.26 s
	D no GC:        0.96 s

N = 1_600_000:
	Psyco:          1.19 s
	Psyco stripped: 1.35 s
	D:              2.80 s
	D no GC:        2.08 s

Note that disabling the GC in those two Python programs has no effects on their running time.

-------------------------------------

To be sure to not compare apples with oranges I have written two "empty" benchmarks, that measure the time needed to just read the lines:

D code:

import std.stream;
void main() {
    foreach (string line; new BufferedFile("words.txt"))
        line.dup;
}

Python code:

def main():
    for line in file("words.txt"):
        pass
import psyco; psyco.full()
main()

The D version contains a dup to make the comparison more accurate, because the "line" variable contains an actual copy and not just a slice.

Line reading timings, N = 1_600_000:
  D:     0.58 s
  Psyco: 0.30 s

So the I/O of Phobos is slower and may enjoy some tuning, so my timings of the hash benchmarks are off.

Removing 0.58 - 0.30 = 0.28 seconds from the timings of the D associative array benchmarks you have:

N = 1_600_000:
	Psyco:          1.19 s
	Psyco stripped: 1.35 s
	D:              2.80 s
	D no GC:        2.08 s
	D, I/O c:       2.80 - 0.28 = 2.52 s
	D no GC, I/O c: 2.08 - 0.28 = 1.80 s

------------------------------------

To hopefully create a more meaningful benchmark I have then written code that loads all the lines before creating the hash.

The Python+Psyco code:

from timeit import default_timer as clock
def main():
    words = []
    for line in open("words.txt"):
        words.append(line.rstrip())
    t = clock()
    d = {}
    for line in words:
        d[line] = 0
    print round(clock() - t, 2), "s"
import psyco; psyco.bind(main)
main()

The D code:

import std.stream, std.stdio, std.c.time, std.gc;
void main() {
    string[] words;
    foreach (string line; new BufferedFile("words.txt"))
        words ~= line.dup;
    //disable();
    auto t = clock();
    int[string] d;
    foreach (word; words)
        d[word] = 0;
    writefln((cast(double)(clock()-t))/CLOCKS_PER_SEC, " s");
}

Timings, N = 1_600_000:
	Psyco:          0.61 s  (total running time: 1.36 s)
	D:              1.42 s  (total running time: 2.46 s)
	D no GC:        1.22 s  (total running time: 2.28 s)

Bye,
bearophile