std.benchmark is in reviewable state

Sun Sep 25 21:56:58 PDT 2011

On Sun, 25 Sep 2011 23:06:01 -0400, Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
> On 9/25/11 9:02 PM, Robert Jacques wrote:
>> On Sun, 25 Sep 2011 21:08:45 -0400, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> wrote:
>>> I've had a good time with std.benchmark today and made it ready for
>>> submission to Phobos. As a perk I've added the %h format specifier,
>>> which formats a number in human-readable form using SI prefixes.
>>>
>>> For those interested in discussing this prior to the formal review
>>> process, the code is at:
>>>
>>> https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d
>>> https://github.com/andralex/phobos/blob/benchmark/std/format.d
>>>
>>> and the dox at
>>>
>>> http://erdani.com/d/web/phobos-prerelease/std_benchmark.html
>>> http://erdani.com/d/web/phobos-prerelease/std_format.html
>>>
>>> Comments and suggestions are welcome. If possible, it would be great if
>>> this submission were given a bump in the priority queue. This is because
>>> with robust benchmarking in Phobos we can ask new, performance-sensitive
>>> submissions, to add benchmarking.
>>>
>>>
>>> Thanks,
>>>
>>> Andrei
>>
>> Andrei, good job on the framework design, but you've left all of the
>> design flaws of the current benchmark routine. To quote myself from last
>> Tuesday:
>>
>>>> On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr at gmx.ch>
>>>> wrote:
>>> [snip]
>>>
>>>> Thank you for making this more meaningful! I assumed the standard
>>>> library benchmark function would take care of those things. Should it?
>>> Yes and no. Benchmark provides a good way to make a single measurement of
>>> a function, as for really short functions you do have to loop many
>>> timesto be able to get a reliable reading. However, actual
>>> benchmarking requiresa) tuning the benchmark() call time to about
>>> 10-20 ms and b) runningbenchmark() many times, taking the minimum. The
>>> idea is that on any givenrun you could hit a context switch, etc. so
>>> if you make multiple run, onewill get lucky and not be interrupted.
>>> Worse, if a core switch happensbetween StopWatch start and end, the
>>> number of ticks returned is random.Hence, the comment to manually
>>> limit execution to a single core. So, itmight be nice if benchmark()
>>> also took a second parameter, denoting thenumber of times to benchmark
>>> the function and had some better documentationon how to do good
>>> benchmarking.
>>
>> Wikipedia also lists some of the challenges to modern benchmarking:
>> http://en.wikipedia.org/wiki/Time_Stamp_Counter
>>
>> So, std.benchmark should
>> [ ] Set the affinity to a single thread at start (i.e. use
>> SetProcessAffinityMask, etc)
>
> I think that won't influence most measurements measurably.

First, a core jump is guaranteed to, essentially, poison the cache. Which, at least in my eyes, invalidates that particular timing run.

Second, timing generally relies on the CPUs Time Stamp Counter, which is not multi-thread safe; a core switch invalidates all previous TSC values, and hence, the time measurement itself. Furthermore, the TSC is not even guaranteed to have a fixed frequency on some CPUs. Now there are ways around the problems of the TSC, but even so:

(From the Wikipedia)
"Under Windows platforms, Microsoft strongly discourages using the TSC for high-resolution timing for exactly these reasons, providing instead the Windows APIs QueryPerformanceCounter and QueryPerformanceFrequency.[2] Even when using these functions, Microsoft recommends the code to be locked to a single CPU."

>> [ ] Repeat the measurement N times, taking the min.
>
> std.benchmark repeats the measurement in a loop, discounts the times
> spent in the iteration proper, and divides total time by the number of
> iterations to figure the time per iteration. This has the advantage that
> works with even very fast functions without letting the counter itself
> affect the timing.

Which is necessity but not sufficient for proper benchmarking. The timers themselves are noisy, to say nothing of the effects of context switches, warm-up, etc on a run. During the ptr+length vs ptr+ptr discussion on Tuesday, the naive use benchmark lead to some very wrong conclusions simply because any single run had up to ~5% of error. The only way to get rid of this error, is to make multiple runs.

>> [X] Adaptively increase the number of loop iterations to ensure a valid
>> reading.
>
> The current system tries 10, 100, 1000 iterations until it gets to a
> total run time of at least 0.5 seconds. That's quite a generous margin,
> I plan to tighten it.
>
>> [ ] Adaptively decrease the number of loop iterations to ensure minimal
>> context switching.
>
> OK.
>
>
> Andrei