std.benchmark is in reviewable state

Mon Sep 26 06:39:28 PDT 2011

On 9/25/11 11:56 PM, Robert Jacques wrote:
> On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> wrote:
>>> So, std.benchmark should
>>> [ ] Set the affinity to a single thread at start (i.e. use
>>> SetProcessAffinityMask, etc)
>>
>> I think that won't influence most measurements measurably.
>
> First, a core jump is guaranteed to, essentially, poison the cache.
> Which, at least in my eyes, invalidates that particular timing run.

Well it will be one of many runs. The questions are if (a) core jumps 
are likely to happen during a benchmark run, (b) many enough will happen 
to influence a 500ms time window.

> Second, timing generally relies on the CPUs Time Stamp Counter, which is
> not multi-thread safe; a core switch invalidates all previous TSC
> values, and hence, the time measurement itself. Furthermore, the TSC is
> not even guaranteed to have a fixed frequency on some CPUs. Now there
> are ways around the problems of the TSC, but even so:
>
> (From the Wikipedia)
> "Under Windows platforms, Microsoft strongly discourages using the TSC
> for high-resolution timing for exactly these reasons, providing instead
> the Windows APIs QueryPerformanceCounter and
> QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
> recommends the code to be locked to a single CPU."

std.benchmark uses QueryPerformanceCounter on Windows and 
clock_gettime/gettimeofday on Unix.

>>> [ ] Repeat the measurement N times, taking the min.
>>
>> std.benchmark repeats the measurement in a loop, discounts the times
>> spent in the iteration proper, and divides total time by the number of
>> iterations to figure the time per iteration. This has the advantage that
>> works with even very fast functions without letting the counter itself
>> affect the timing.
>
> Which is necessity but not sufficient for proper benchmarking. The
> timers themselves are noisy, to say nothing of the effects of context
> switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr
> discussion on Tuesday, the naive use benchmark lead to some very wrong
> conclusions simply because any single run had up to ~5% of error. The
> only way to get rid of this error, is to make multiple runs.

You may want to take a look at the code. This is a long argument ending 
with "...make multiple runs" as if std.benchmark is not doing them and 
you suggest it should. Again, std.benchmark runs the tested function in 
a loop in powers of 10 until the entire run lasts at least 500ms.

>>> [X] Adaptively increase the number of loop iterations to ensure a valid
>>> reading.
>>
>> The current system tries 10, 100, 1000 iterations until it gets to a
>> total run time of at least 0.5 seconds. That's quite a generous margin,
>> I plan to tighten it.
>>
>>> [ ] Adaptively decrease the number of loop iterations to ensure minimal
>>> context switching.
>>
>> OK.

In fact upon further thinking the iterations should not be too few. 
Context switches and other activity is inevitable, but the effects are 
negligible over long periods (as half a second would be).

To conclude, I'm unsure what steps you suggest to take to improve 
std.benchmark, and I'd be grateful if you could clarify them after 
perusing its code and documentation.

Thanks,

Andrei