Rquest for timings

bearophile bearophileHUGS at lycos.com
Sun Nov 27 17:49:34 PST 2011


> Here those square roots are parallelizable, the compiler is allowed to use a SSE2 sqrtpd instruction to performs those 10 sqrt(double) with 5 instructions. With the ymm register of AVX the instruction VSQRTPD (intrinsic _mm256_sqrt_pd in lesser languages) does 4 double squares at a time. But maybe its starting location needs to be aligned to 16 bytes (not currently supported syntax):

The 32bit assembly produced by the Intel Fortran compiler on that code, it's heavily optimized and fully inlined:
http://codepad.org/h1ilZWVu

It uses only serial square roots (sqrtsd), so the performance improvement has other causes that I don't know. This also probably means the Fortran version is not the faster version possible.

Bye,
bearophile


More information about the Digitalmars-d-learn mailing list