Benchmark Design (was: Why Strings as Classes?)
bearophile
bearophileHUGS at lycos.com
Thu Aug 28 06:50:46 PDT 2008
This is another small associative array benchmark, with strings. Please spot any problem/bug in it.
You can generate the data set with this Python script:
from array import array
from random import randrange, seed
def generate(filename, N):
fout = file(filename, "w")
seed(1)
a = array("B", " ") * 10
for i in xrange(N):
n = randrange(3, 11)
for j in xrange(n):
a[j] = randrange(97, 123)
print >>fout, a.tostring()[:n]
import psyco; psyco.full()
generate("words.txt", 1600000)
It generates a text file of about 13.3 MB, each line contains a random word. Such dataset isn't exactly like a real dataset, because generally words aren't random, they contain lot of redundancy that may worsen a little the performance of an hash function. So this is probably a better situation for an associative array.
The D code:
import std.stream, std.stdio, std.gc;
void main() {
//disable();
int[string] d;
foreach (string line; new BufferedFile("words.txt"))
d[line.dup] = 0;
}
Note that this program tests the I/O performance too. If you want to avoid that you can read all the file lines up-front and time just the AA creation. (SEE BELOW).
This is a first Python+Psyco version, it's not equal, because the good Python developers have chosen the newlines at the end of the words:
def main():
d = {}
for line in file("words.txt"):
d[line] = 0
import psyco; psyco.full()
main()
This second slower Python version strips the lines from their newline, rstrip() is similar to std.string.stripr():
def main():
d = {}
for line in file("words.txt"):
d[line.rstrip()] = 0
import psyco; psyco.full()
main()
Few timings:
N = 800_000:
Psyco: 0.69 s
Psyco stripped: 0.77 s
D: 1.26 s
D no GC: 0.96 s
N = 1_600_000:
Psyco: 1.19 s
Psyco stripped: 1.35 s
D: 2.80 s
D no GC: 2.08 s
Note that disabling the GC in those two Python programs has no effects on their running time.
-------------------------------------
To be sure to not compare apples with oranges I have written two "empty" benchmarks, that measure the time needed to just read the lines:
D code:
import std.stream;
void main() {
foreach (string line; new BufferedFile("words.txt"))
line.dup;
}
Python code:
def main():
for line in file("words.txt"):
pass
import psyco; psyco.full()
main()
The D version contains a dup to make the comparison more accurate, because the "line" variable contains an actual copy and not just a slice.
Line reading timings, N = 1_600_000:
D: 0.58 s
Psyco: 0.30 s
So the I/O of Phobos is slower and may enjoy some tuning, so my timings of the hash benchmarks are off.
Removing 0.58 - 0.30 = 0.28 seconds from the timings of the D associative array benchmarks you have:
N = 1_600_000:
Psyco: 1.19 s
Psyco stripped: 1.35 s
D: 2.80 s
D no GC: 2.08 s
D, I/O c: 2.80 - 0.28 = 2.52 s
D no GC, I/O c: 2.08 - 0.28 = 1.80 s
------------------------------------
To hopefully create a more meaningful benchmark I have then written code that loads all the lines before creating the hash.
The Python+Psyco code:
from timeit import default_timer as clock
def main():
words = []
for line in open("words.txt"):
words.append(line.rstrip())
t = clock()
d = {}
for line in words:
d[line] = 0
print round(clock() - t, 2), "s"
import psyco; psyco.bind(main)
main()
The D code:
import std.stream, std.stdio, std.c.time, std.gc;
void main() {
string[] words;
foreach (string line; new BufferedFile("words.txt"))
words ~= line.dup;
//disable();
auto t = clock();
int[string] d;
foreach (word; words)
d[word] = 0;
writefln((cast(double)(clock()-t))/CLOCKS_PER_SEC, " s");
}
Timings, N = 1_600_000:
Psyco: 0.61 s (total running time: 1.36 s)
D: 1.42 s (total running time: 2.46 s)
D no GC: 1.22 s (total running time: 2.28 s)
Bye,
bearophile
More information about the Digitalmars-d
mailing list