std.benchmark is in reviewable state

Wed Sep 28 08:13:07 PDT 2011

On Wed, 28 Sep 2011 08:19:16 -0400, Christophe <travert at phare.normalesup.org> wrote:
> "Robert Jacques" , dans le message (digitalmars.D:145443), a écrit :
>> *sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday:
>>
>>      double min_f1 = int.max;
>>      double min_f2 = int.max;
>>      foreach(i;0..10_000) {
>>          auto b=benchmark!(f1,f2)(3);
>>          min_f1 = min(min_f1, b[0].to!("seconds",double));
>>          min_f2 = min(min_f2, b[1].to!("seconds",double));
>>      }
>>
>> All the looping that benchmark currently does is to lengthen the total
>> time a function takes. This makes it possible to avoid sampling errors
>> with regard to the performance counter itself. But you are still only
>> making a single, noisy measurement of a (meta)function. In order to
>> get a robust result, you need to make multiple measurements and take
>> the minimum.
>
> Sorry to question you, but what makes you say that the minimum is more
> interesting than, let's say, the mean + the standard deviation. Is there
> any paper supporting this ?
>
> I am not a benchmark specialist, but an average statistician who knows
> that the minimum is not always very informative (a very low value can be
> a piece of luck). I am sure you may have good reasons to choose the
> minimum, but I just want to make sure we make the right choice by
> choosing to use the minimun of consecutive benchmarks.
>

The reason that the mean + std is so pervasive in science is that it's perfect for a Gaussian noise model. In most physical measurements, you expect some readings to be both above and below the actual value and for most processes we assume the error of these measurements is normally distributed. Importantly, when we know the process not to produce normally distibuted error, we look to other statistics for meaning. Now, for computer timing measurements, you have access to the physical number of clock cycles taken. There is some noise in this, but it's on the order of 1-2 cycles (i.e. ns). As a benchmark is trying to measure the number of cycles something takes to complete, and it is physically impossible to complete that task in less than a fixed number of cycles, 'noise' in the measurement above 1-2 ns is by definition purely additive due to OS or hardware effects (i.e. a context switch occurred, a page fault happened, you put your computer to sleep, etc.). So the best way to get  
to the true number of cycles a task takes to complete is to take the minimum recorded value, on the premise that you did get lucky and those other issues which you are not trying to measure, had little to no impact.

So that's the rational. As for academic support, I've published peer-reviewed papers using this methodology, but I can't remember where I originally got the advice. Also, there are certain types of benchmarks for which the mean + std is better (networking comes to mind), etc.

P.S. A little Google-fu turned up: The science of computer benchmarking By Roger W. Hockney, which does recommend using the minimum, particularly to get around the problem of time-slicing (i.e. context switching) by the OS.