Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Fri Feb 26 05:50:36 UTC 2021

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> How does one optimize code to make full use of the CPU's SIMD 
> capabilities?
> Is there any way to guarantee that "packed" versions of SIMD 
> instructions will be used?(e.g. vmulps, vsqrtps, etc...)
> To give some context, this is a sample of one of the functions 
> that could benefit from better SIMD usage :
>>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
>>   float distance;
>>   a[] -= b[];
>>   a[] *= a[];
>>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>>       distance += a[i].abs;//abs required by the caller
>>   }
>>   return sqrt(distance);
>>}
>>vmovsd xmm0,qword ptr ds:[rdx]
>>vmovss xmm1,dword ptr ds:[rdx+8]
>>vmovsd xmm2,qword ptr ds:[rcx+4]
>>vsubps xmm0,xmm0,xmm2
>>vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
>>vmulps xmm0,xmm0,xmm0
>>vmulss xmm1,xmm1,xmm1
>>vbroadcastss xmm2,dword ptr ds:[<__real at 7fffffff>]
>>vandps xmm0,xmm0,xmm2
>>vpermilps xmm3,xmm0,F5
>>vaddss xmm0,xmm0,xmm3
>>vandps xmm1,xmm1,xmm2
>>vaddss xmm0,xmm0,xmm1
>>vsqrtss xmm0,xmm0,xmm0
>>vmovaps xmm6,xmmword ptr ss:[rsp+20]
>>add rsp,38
>>ret
>
> I've tried to experiment with dynamic arrays of float[3] but 
> the output assembly seemed to be worse.[1](in short, it's 
> calling internal D functions which use "vxxxss" instructions 
> while performing many moves.)
>
> Big thanks
> [1] https://run.dlang.io/is/F3Xye3

If you are developing for deployment to a platform that has a 
GPU, you might consider going SIMT (dcompute) rather than SIMD.  
SIMT is a lot easier on the eyes.  More importantly, if you're 
targetting an SoC or console, or have relatively chunky 
computations that allow you to work around the PCIe transit 
costs, the path is open to very large performance improvements.  
I've only been using dcompute for a week or so but so far it's 
been great.

If your algorithims are very branchy, or you decide to stick with 
multi-core/SIMD for any of a number of other good reasons, here 
are a few things I learned before decamping to dcompute land that 
may help:

   1)  LDC is pretty good at auto vectorization as you have 
probably observed.  Definitely worth a few iterations to try and 
get the vectorizer engaged.

   2)  LDC auto vectorization was good but explicit __vector 
programming is more predictable and was, at least for my tasks, 
much faster. I couldn't persuade the auto vectorizer to "do the 
right thing" throughout the hot path but perhaps you'll have 
better luck.

   3)  LDC does a good job of going between T[N] <==> 
__vector(T[N]) so using the static array types as your 
input/output types and the __vector types as your compute types 
works out well whenever you have to interface with an unaligned 
world. LDC issues unaligned vector loads/stores for casts or full 
array assigns: v = cast(VT)sa[];  or v[] = sa[];  These are quite 
good on modern CPUs.  To calibrate note that Ethan recently 
talked about a 10% gain he experienced using full alignment, 
IIRC, so there's that.

   4) LDC also does a good job of discovering SIMD equivalents 
given static foreach unrolled loops with explicit complie-time 
indexing of vector element operands.  You can use those along 
with pragma(inline, true) to develop your own "intrinsics" that 
supplement other libs.

   5) If you adopt the __vector approach you'll have to handle the 
partials manually. (array length % vec length != 0 indicates a 
partial or tail fragment).  If the classic (copying/padding) 
approaches to such fragmentation don't work for you I'd suggest 
using nested static functions that take ref T[N] inputs and 
outputs.  The main loops become very simple and the tail handling 
reduces to loading stack allocated T[N] variables explicitly, 
calling the static function, and unloading.

Good luck.