Optimizing for SIMD: best practices?(i.e. what features are allowed?)
tsbockman
thomas.bockman at gmail.com
Fri Feb 26 03:38:51 UTC 2021
On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> Is there any way to guarantee that "packed" versions of SIMD
> instructions will be used?(e.g. vmulps, vsqrtps, etc...)
> To give some context, this is a sample of one of the functions
> that could benefit from better SIMD usage :
>>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
You need to use __vector(float[4]) instead of float[3] to tell
the compiler to pack multiple elements per SIMD register. Right
now your data lacks proper alignment for SIMD load/stores.
Beyond that, SIMD code is rather difficult to optimize. Code
written in ignorance or in a rush is unlikely to be meaningfully
faster than ordinary scalar code, unless the data flow is very
simple. You will probably get a bigger speedup for less effort
and pain by first minimizing heap allocations, maximizing
locality of reference, minimizing indirections, and minimizing
memory use. (And, of course, it should go without saying that
choosing an asymptotically efficient high-level algorithm is more
important than any micro-optimization for large data sets.)
Nevertheless, if you are up to the challenge, SIMD can sometimes
provide a final 2-3x speed boost.
Your algorithms will need to be designed to minimize mixing of
data between SIMD channels, as this forces the generation of lots
of extra instructions to swizzle the data, or worse to unpack and
repack it. Something like a Cartesian dot product or cross
product will benefit much less from SIMD than vector addition,
for example. Sometimes the amount of swizzling can be greatly
reduced with a little algebra, other times you might need to
refactor an array of structures into a structure of arrays.
Per-element conditional branches are very bad, and often
completely defeat the benefits of SIMD. For very short segments
of code (like conditional assignment), replace them with a SIMD
conditional move (vcmp and vblend). Bit-twiddling is your friend.
Finally, do not trust the compiler or the optimizer. People love
to make the claim that "The Compiler" is always better than
humans at micro-optimizations, but this is not at all the case
for SIMD code with current systems. I have found even LLVM to
produce quite bad SIMD code for complex algorithms, unless I
carefully structure my code to make it as easy as possible for
the optimizer to get to the final assembly I want. A sprinkling
of manual assembly code (directly, or via a library) is also
necessary to fill in certain instructions that the compiler
doesn't know when to use at all.
Resources I have found very helpful:
Matt Godbolt's Compiler Explorer online visual disassembler
(supports D):
https://godbolt.org/
Felix Cloutier's x86 and amd64 instruction reference:
https://www.felixcloutier.com/x86/
Agner Fog's optimization guide (especially the instruction
tables):
https://agner.org/optimize/
More information about the Digitalmars-d-learn
mailing list