Review of Andrei's std.benchmark

Fri Sep 21 14:12:03 PDT 2012

On 9/21/12 2:49 PM, David Piepgrass wrote:
>> After extensive tests with a variety of aggregate functions, I can say
>> firmly that taking the minimum time is by far the best when it comes
>> to assessing the speed of a function.
>
> Like others, I must also disagree in princple. The minimum sounds like a
> useful metric for functions that (1) do the same amount of work in every
> test and (2) are microbenchmarks, i.e. they measure a small and simple
> task.

That is correct.

> If the benchmark being measured either (1) varies the amount of
> work each time (e.g. according to some approximation of real-world
> input, which obviously may vary)* or (2) measures a large system, then
> the average and standard deviation and even a histogram may be useful
> (or perhaps some indicator whether the runtimes are consistent with a
> normal distribution or not). If the running-time is long then the max
> might be useful (because things like task-switching overhead probably do
> not contribute that much to the total).
>
> * I anticipate that you might respond "so, only test a single input per
> benchmark", but if I've got 1000 inputs that I want to try, I really
> don't want to write 1000 functions nor do I want 1000 lines of output
> from the benchmark. An average, standard deviation, min and max may be
> all I need, and if I need more detail, then I might break it up into 10
> groups of 100 inputs. In any case, the minimum runtime is not the
> desired output when the input varies.

I understand. What we currently do at Facebook is support benchmark 
functions with two parameters (see 
https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md). 
One is the number of iterations, the second is "problem size", akin to 
what you're discussing.

I chose to not support that in this version of std.benchmark because it 
can be tackled later easily, but I probably need to add it now, sigh.

> It's a little surprising to hear "The purpose of std.benchmark is not to
> estimate real-world time. (That is the purpose of profiling)"...
> Firstly, of COURSE I would want to estimate real-world time with some of
> my benchmarks. For some benchmarks I just want to know which of two or
> three approaches is faster, or to get a coarse ball-park sense of
> performance, but for others I really want to know the wall-clock time
> used for realistic inputs.

I would contend that a benchmark without a baseline is very often 
misguided. I've seen tons and tons and TONS of nonsensical benchmarks 
lacking a baseline. "I created one million smart pointers, it took me 
only one millisecond!" Well how long did it take you to create one 
million dumb pointers?

Choosing good baselines and committing to good comparisons instead of 
un-based absolutes is what makes the difference between a professional 
and a well-intended dilettante.

> Secondly, what D profiler actually helps you answer the question "where
> does the time go in the real-world?"? The D -profile switch creates an
> instrumented executable, which in my experience (admittedly not
> experience with DMD) severely distorts running times. I usually prefer
> sampling-based profiling, where the executable is left unchanged and a
> sampling program interrupts the program at random and grabs the call
> stack, to avoid the distortion effect of instrumentation. Of course,
> instrumentation is useful to find out what functions are called the most
> and whether call frequencies are in line with expectations, but I
> wouldn't trust the time measurements that much.
>
> As far as I know, D doesn't offer a sampling profiler, so one might
> indeed use a benchmarking library as a (poor) substitute. So I'd want to
> be able to set up some benchmarks that operate on realistic data, with
> perhaps different data in different runs in order to learn about how the
> speed varies with different inputs (if it varies a lot then I might
> create more benchmarks to investigate which inputs are processed
> quickly, and which slowly.)

I understand there's a good case to be made for profiling. If this turns 
out to be an acceptance condition for std.benchmark (which I think it 
shouldn't), I'll define one.

> Some random comments about std.benchmark based on its documentation:
>
> - It is very strange that the documentation of printBenchmarks uses
> neither of the words "average" or "minimum", and doesn't say how many
> trials are done....

Because all of those are irrelevant and confusing. We had an older 
framework at Facebook that reported those numbers, and they were utterly 
and completely meaningless. Besides the trials column contained numbers 
that were not even comparable. Everybody was happy when I removed them 
with today's simple and elegant numbers.

> I suppose the obvious interpretation is that it only
> does one trial, but then we wouldn't be having this discussion about
> averages and minimums right? Øivind says tests are run 1000 times... but
> it needs to be configurable per-test (my idea: support a _x1000 suffix
> in function names, or _for1000ms to run the test for at least 1000
> milliseconds; and allow a multiplier when when running a group of
> benchmarks, e.g. a multiplier argument of 0.5 means to only run half as
> many trials as usual.)

I don't think that's a good idea.

> Also, it is not clear from the documentation what
> the single parameter to each benchmark is (define "iterations count".)

The documentation could include that, but I don't want to overspecify.

> - The "benchmark_relative_" feature looks quite useful. I'm also happy
> to see benchmarkSuspend() and benchmarkResume(), though
> benchmarkSuspend() seems redundant in most cases: I'd like to just call
> one function, say, benchmarkStart() to indicate "setup complete, please
> start measuring time now."

Good point. I think this is a minor encumbrance, so it's good to keep 
generality.

> - I'm glad that StopWatch can auto-start; but the documentation should
> be clearer: does reset() stop the timer or just reset the time to zero?
> does stop() followed by start() start from zero or does it keep the time
> on the clock? I also think there should be a method that returns the
> value of peek() and restarts the timer at the same time (perhaps stop()
> and reset() should just return peek()?)
>
> - After reading the documentation of comparingBenchmark and measureTime,
> I have almost no idea what they do.

Yah, these are moved over from std.datetime. I'll need to make a couple 
more passes through the dox.

Andrei