Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Fri Feb 26 03:38:51 UTC 2021

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> Is there any way to guarantee that "packed" versions of SIMD 
> instructions will be used?(e.g. vmulps, vsqrtps, etc...)
> To give some context, this is a sample of one of the functions 
> that could benefit from better SIMD usage :
>>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {

You need to use __vector(float[4]) instead of float[3] to tell 
the compiler to pack multiple elements per SIMD register. Right 
now your data lacks proper alignment for SIMD load/stores.

Beyond that, SIMD code is rather difficult to optimize. Code 
written in ignorance or in a rush is unlikely to be meaningfully 
faster than ordinary scalar code, unless the data flow is very 
simple. You will probably get a bigger speedup for less effort 
and pain by first minimizing heap allocations, maximizing 
locality of reference, minimizing indirections, and minimizing 
memory use. (And, of course, it should go without saying that 
choosing an asymptotically efficient high-level algorithm is more 
important than any micro-optimization for large data sets.) 
Nevertheless, if you are up to the challenge, SIMD can sometimes 
provide a final 2-3x speed boost.

Your algorithms will need to be designed to minimize mixing of 
data between SIMD channels, as this forces the generation of lots 
of extra instructions to swizzle the data, or worse to unpack and 
repack it. Something like a Cartesian dot product or cross 
product will benefit much less from SIMD than vector addition, 
for example. Sometimes the amount of swizzling can be greatly 
reduced with a little algebra, other times you might need to 
refactor an array of structures into a structure of arrays.

Per-element conditional branches are very bad, and often 
completely defeat the benefits of SIMD. For very short segments 
of code (like conditional assignment), replace them with a SIMD 
conditional move (vcmp and vblend). Bit-twiddling is your friend.

Finally, do not trust the compiler or the optimizer. People love 
to make the claim that "The Compiler" is always better than 
humans at micro-optimizations, but this is not at all the case 
for SIMD code with current systems. I have found even LLVM to 
produce quite bad SIMD code for complex algorithms, unless I 
carefully structure my code to make it as easy as possible for 
the optimizer to get to the final assembly I want. A sprinkling 
of manual assembly code (directly, or via a library) is also 
necessary to fill in certain instructions that the compiler 
doesn't know when to use at all.

Resources I have found very helpful:

Matt Godbolt's Compiler Explorer online visual disassembler 
(supports D):
     https://godbolt.org/

Felix Cloutier's x86 and amd64 instruction reference:
     https://www.felixcloutier.com/x86/

Agner Fog's optimization guide (especially the instruction 
tables):
     https://agner.org/optimize/