Review of Andrei's std.benchmark

Fri Sep 21 11:49:33 PDT 2012

> After extensive tests with a variety of aggregate functions, I 
> can say firmly that taking the minimum time is by far the best 
> when it comes to assessing the speed of a function.

Like others, I must also disagree in princple. The minimum sounds 
like a useful metric for functions that (1) do the same amount of 
work in every test and (2) are microbenchmarks, i.e. they measure 
a small and simple task. If the benchmark being measured either 
(1) varies the amount of work each time (e.g. according to some 
approximation of real-world input, which obviously may vary)* or 
(2) measures a large system, then the average and standard 
deviation and even a histogram may be useful (or perhaps some 
indicator whether the runtimes are consistent with a normal 
distribution or not). If the running-time is long then the max 
might be useful (because things like task-switching overhead 
probably do not contribute that much to the total).

* I anticipate that you might respond "so, only test a single 
input per benchmark", but if I've got 1000 inputs that I want to 
try, I really don't want to write 1000 functions nor do I want 
1000 lines of output from the benchmark. An average, standard 
deviation, min and max may be all I need, and if I need more 
detail, then I might break it up into 10 groups of 100 inputs. In 
any case, the minimum runtime is not the desired output when the 
input varies.

It's a little surprising to hear "The purpose of std.benchmark is 
not to estimate real-world time. (That is the purpose of 
profiling)"... Firstly, of COURSE I would want to estimate 
real-world time with some of my benchmarks. For some benchmarks I 
just want to know which of two or three approaches is faster, or 
to get a coarse ball-park sense of performance, but for others I 
really want to know the wall-clock time used for realistic inputs.

Secondly, what D profiler actually helps you answer the question 
"where does the time go in the real-world?"? The D -profile 
switch creates an instrumented executable, which in my experience 
(admittedly not experience with DMD) severely distorts running 
times. I usually prefer sampling-based profiling, where the 
executable is left unchanged and a sampling program interrupts 
the program at random and grabs the call stack, to avoid the 
distortion effect of instrumentation. Of course, instrumentation 
is useful to find out what functions are called the most and 
whether call frequencies are in line with expectations, but I 
wouldn't trust the time measurements that much.

As far as I know, D doesn't offer a sampling profiler, so one 
might indeed use a benchmarking library as a (poor) substitute. 
So I'd want to be able to set up some benchmarks that operate on 
realistic data, with perhaps different data in different runs in 
order to learn about how the speed varies with different inputs 
(if it varies a lot then I might create more benchmarks to 
investigate which inputs are processed quickly, and which slowly.)

Some random comments about std.benchmark based on its 
documentation:

- It is very strange that the documentation of printBenchmarks 
uses neither of the words "average" or "minimum", and doesn't say 
how many trials are done.... I suppose the obvious interpretation 
is that it only does one trial, but then we wouldn't be having 
this discussion about averages and minimums right? Øivind says 
tests are run 1000 times... but it needs to be configurable 
per-test (my idea: support a _x1000 suffix in function names, or 
_for1000ms to run the test for at least 1000 milliseconds; and 
allow a multiplier when when running a group of benchmarks, e.g. 
a multiplier argument of 0.5 means to only run half as many 
trials as usual.) Also, it is not clear from the documentation 
what the single parameter to each benchmark is (define 
"iterations count".)

- The "benchmark_relative_" feature looks quite useful. I'm also 
happy to see benchmarkSuspend() and benchmarkResume(), though 
benchmarkSuspend() seems redundant in most cases: I'd like to 
just call one function, say, benchmarkStart() to indicate "setup 
complete, please start measuring time now."

- I'm glad that StopWatch can auto-start; but the documentation 
should be clearer: does reset() stop the timer or just reset the 
time to zero? does stop() followed by start() start from zero or 
does it keep the time on the clock? I also think there should be 
a method that returns the value of peek() and restarts the timer 
at the same time (perhaps stop() and reset() should just return 
peek()?)

- After reading the documentation of comparingBenchmark and 
measureTime, I have almost no idea what they do.