std.benchmark is in reviewable state

Mon Sep 26 08:43:52 PDT 2011

On Mon, 26 Sep 2011 09:39:28 -0400, Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:

> On 9/25/11 11:56 PM, Robert Jacques wrote:
>> On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> wrote:
>>>> So, std.benchmark should
>>>> [ ] Set the affinity to a single thread at start (i.e. use
>>>> SetProcessAffinityMask, etc)
>>>
>>> I think that won't influence most measurements measurably.
>>
>> First, a core jump is guaranteed to, essentially, poison the cache.
>> Which, at least in my eyes, invalidates that particular timing run.
>
> Well it will be one of many runs. The questions are if (a) core jumps
> are likely to happen during a benchmark run, (b) many enough will happen
> to influence a 500ms time window.

I don't know exactly how likely core jump are, but context switches definitely happen. And yes, in my experience, they influence a 50 ms time window. Remember, context switches have two outcomes: your thread continues running or your thread is paused. But the timer doesn't pause, when your thread pauses. So all you need is one bad switch to see an effect.

>> Second, timing generally relies on the CPUs Time Stamp Counter, which is
>> not multi-thread safe; a core switch invalidates all previous TSC
>> values, and hence, the time measurement itself. Furthermore, the TSC is
>> not even guaranteed to have a fixed frequency on some CPUs. Now there
>> are ways around the problems of the TSC, but even so:
>>
>> (From the Wikipedia)
>> "Under Windows platforms, Microsoft strongly discourages using the TSC
>> for high-resolution timing for exactly these reasons, providing instead
>> the Windows APIs QueryPerformanceCounter and
>> QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
>> recommends the code to be locked to a single CPU."
>
> std.benchmark uses QueryPerformanceCounter on Windows and
> clock_gettime/gettimeofday on Unix.

Great, but MS still recommends benchmarking be done on a single core. And if MS thinks that is how benchmarking should be done, I think that's how we should do it.

 From MSDN documentation on QueryPerformanceCounter:
"On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function."

>>>> [ ] Repeat the measurement N times, taking the min.
>>>
>>> std.benchmark repeats the measurement in a loop, discounts the times
>>> spent in the iteration proper, and divides total time by the number of
>>> iterations to figure the time per iteration. This has the advantage that
>>> works with even very fast functions without letting the counter itself
>>> affect the timing.
>>
>> Which is necessity but not sufficient for proper benchmarking. The
>> timers themselves are noisy, to say nothing of the effects of context
>> switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr
>> discussion on Tuesday, the naive use benchmark lead to some very wrong
>> conclusions simply because any single run had up to ~5% of error. The
>> only way to get rid of this error, is to make multiple runs.
>
> You may want to take a look at the code. This is a long argument ending
> with "...make multiple runs" as if std.benchmark is not doing them and
> you suggest it should. Again, std.benchmark runs the tested function in
> a loop in powers of 10 until the entire run lasts at least 500ms.

*sigh* By multiple runs, I mean multiple runs of the benchmark function, _including_ the loops. From my post last Tuesday:

     double min_f1 = int.max;
     double min_f2 = int.max;
     foreach(i;0..10_000) {
         auto b=benchmark!(f1,f2)(3);
         min_f1 = min(min_f1, b[0].to!("seconds",double));
         min_f2 = min(min_f2, b[1].to!("seconds",double));
     }

All the looping that benchmark currently does is to lengthen the total time a function takes. This makes it possible to avoid sampling errors with regard to the performance counter itself. But you are still only making a single, noisy measurement of a (meta)function. In order to get a robust result, you need to make multiple measurements and take the minimum.

>>>> [X] Adaptively increase the number of loop iterations to ensure a valid
>>>> reading.
>>>
>>> The current system tries 10, 100, 1000 iterations until it gets to a
>>> total run time of at least 0.5 seconds. That's quite a generous margin,
>>> I plan to tighten it.
>>>
>>>> [ ] Adaptively decrease the number of loop iterations to ensure minimal
>>>> context switching.
>>>
>>> OK.
>
> In fact upon further thinking the iterations should not be too few.
> Context switches and other activity is inevitable, but the effects are
> negligible over long periods (as half a second would be).
>
> To conclude, I'm unsure what steps you suggest to take to improve
> std.benchmark, and I'd be grateful if you could clarify them after
> perusing its code and documentation.
>
>
> Thanks,
>
> Andrei