SIMD on Windows

Sat Jun 29 09:34:23 PDT 2013

On Saturday, 29 June 2013 at 14:39:44 UTC, Jonathan Dunlap wrote:
> Alright, I'm now officially building for Windows x64 (amd64). 
> I've created this early benchmark 
> http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As 
> you can see below, on my machine there is almost zero 
> difference. Am I missing something?
>
> //===SIMD===
> 0 1.#INF 5 1.#INF <-- vector result
> hnsecs: 100006 <-- duration time
> 0 1.#INF 5 1.#INF
> hnsecs: 90005
> 0 1.#INF 5 1.#INF
> hnsecs: 90006
> //===SCALAR===
> 0 1.#INF 5 1.#INF
> hnsecs: 90005
> 0 1.#INF 5 1.#INF
> hnsecs: 100005
> 0 1.#INF 5 1.#INF
> hnsecs: 100006

First of all, calcSIMD and calcScalar are virtual functions so 
they can't be inlined, which prevents any further optimization. 
It also seems that the fact that g, s, i and d are class fields 
and that g is a static array makes DMD load them from memory and 
store them back on every iteration even when calcSIMD and 
calcScalar are inlined.

But even if I make the class final and build it with gdc -O3 
-finline-functions -frelease -march=native (in which case GDC 
generates assembly that looks optimal to me), the scalar version 
is still a bit faster than the vector version. The main reason 
for that is that even with scalar code, the compiler can do 
multiple operations in parallel. So on Sandy Bridge CPUs, for 
example, floating point multiplication takes 5 cycles to 
complete, but the processor can do one multiplication per cycle. 
So my guess is that the first four multiplications and the second 
four multiplications in calcScalar are done in parallel.

That would explain the scalar code being equaly fast, but not 
faster than vector code. The reason it's faster is that gdc 
replaces multiplication by 2 with addition and omits 
multiplication by 1.