SIMD on Windows
jerro
a at a.com
Sat Jun 29 09:34:23 PDT 2013
On Saturday, 29 June 2013 at 14:39:44 UTC, Jonathan Dunlap wrote:
> Alright, I'm now officially building for Windows x64 (amd64).
> I've created this early benchmark
> http://dpaste.dzfl.pl/eae0233e to explore SIMD performance. As
> you can see below, on my machine there is almost zero
> difference. Am I missing something?
>
> //===SIMD===
> 0 1.#INF 5 1.#INF <-- vector result
> hnsecs: 100006 <-- duration time
> 0 1.#INF 5 1.#INF
> hnsecs: 90005
> 0 1.#INF 5 1.#INF
> hnsecs: 90006
> //===SCALAR===
> 0 1.#INF 5 1.#INF
> hnsecs: 90005
> 0 1.#INF 5 1.#INF
> hnsecs: 100005
> 0 1.#INF 5 1.#INF
> hnsecs: 100006
First of all, calcSIMD and calcScalar are virtual functions so
they can't be inlined, which prevents any further optimization.
It also seems that the fact that g, s, i and d are class fields
and that g is a static array makes DMD load them from memory and
store them back on every iteration even when calcSIMD and
calcScalar are inlined.
But even if I make the class final and build it with gdc -O3
-finline-functions -frelease -march=native (in which case GDC
generates assembly that looks optimal to me), the scalar version
is still a bit faster than the vector version. The main reason
for that is that even with scalar code, the compiler can do
multiple operations in parallel. So on Sandy Bridge CPUs, for
example, floating point multiplication takes 5 cycles to
complete, but the processor can do one multiplication per cycle.
So my guess is that the first four multiplications and the second
four multiplications in calcScalar are done in parallel.
That would explain the scalar code being equaly fast, but not
faster than vector code. The reason it's faster is that gdc
replaces multiplication by 2 with addition and omits
multiplication by 1.
More information about the Digitalmars-d
mailing list