std.benchmark is in reviewable state

Thu Sep 29 05:47:10 PDT 2011

On 26.09.2011 17:43, Robert Jacques wrote:
>>> Second, timing generally relies on the CPUs Time Stamp Counter, which is
>>> not multi-thread safe; a core switch invalidates all previous TSC
>>> values, and hence, the time measurement itself. Furthermore, the TSC is
>>> not even guaranteed to have a fixed frequency on some CPUs. Now there
>>> are ways around the problems of the TSC, but even so:
>>>
>>> (From the Wikipedia)
>>> "Under Windows platforms, Microsoft strongly discourages using the TSC
>>> for high-resolution timing for exactly these reasons, providing instead
>>> the Windows APIs QueryPerformanceCounter and
>>> QueryPerformanceFrequency.[2] Even when using these functions, Microsoft
>>> recommends the code to be locked to a single CPU."
>>
>> std.benchmark uses QueryPerformanceCounter on Windows and
>> clock_gettime/gettimeofday on Unix.
>
> Great, but MS still recommends benchmarking be done on a single core.
> And if MS thinks that is how benchmarking should be done, I think that's
> how we should do it.

I think that's quite misleading. Microsoft has made some assumptions in 
there about what you are measuring. QueryPerformanceCounter gives you 
the wall clock time. But, with modern CPUs (esp, core i7 Sandy Bridge, 
but also as far back as Pentium M) the CPU frequency isn't constant.
So, the wall clock time depends both on the number of clock cycles 
required to execute the code, AND on the temperature of the CPU!
If you run the same benchmark code enough times, it'll eventually get 
slower as the CPU heats up!

Personally I always use the TSC, and then discard any results where the 
TSC is inconsistent. Specifically I use the hardware performance 
counters since it gives you real information: code #1 executes N more 
instructions than code #2, it has B fewer branches, it has D more level 
1 cache misses, etc. I've managed to get rock-solid data that way, but 
it takes a lot of work (eg, you have to make sure that your stack is 
aligned to 16 bytes).
BUT this sort of thing is only relevant to really small sections of code 
(< 1 time slice).

If you're timing something which involves hardware other than the CPU 
(eg, database access, network access) then you really want the wall 
clock time. And if the section of code is long enough, then maybe you do 
care how much it heats up the CPU!
But OTOH once you're in that regime, I don't think you care about 
processor affinity. In real life, you WILL get core transitions.

So I don't think it's anywhere near as simple as what Microsoft makes 
out. It's very important to be clear about what kind of stuff you intend 
to measure.