Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Thu Feb 25 11:28:14 UTC 2021

How does one optimize code to make full use of the CPU's SIMD 
capabilities?
Is there any way to guarantee that "packed" versions of SIMD 
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions 
that could benefit from better SIMD usage :
>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
>   float distance;
>   a[] -= b[];
>   a[] *= a[];
>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>       distance += a[i].abs;//abs required by the caller
>   }
>   return sqrt(distance);
>}
>vmovsd xmm0,qword ptr ds:[rdx]
>vmovss xmm1,dword ptr ds:[rdx+8]
>vmovsd xmm2,qword ptr ds:[rcx+4]
>vsubps xmm0,xmm0,xmm2
>vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
>vmulps xmm0,xmm0,xmm0
>vmulss xmm1,xmm1,xmm1
>vbroadcastss xmm2,dword ptr ds:[<__real at 7fffffff>]
>vandps xmm0,xmm0,xmm2
>vpermilps xmm3,xmm0,F5
>vaddss xmm0,xmm0,xmm3
>vandps xmm1,xmm1,xmm2
>vaddss xmm0,xmm0,xmm1
>vsqrtss xmm0,xmm0,xmm0
>vmovaps xmm6,xmmword ptr ss:[rsp+20]
>add rsp,38
>ret

I've tried to experiment with dynamic arrays of float[3] but the 
output assembly seemed to be worse.[1](in short, it's calling 
internal D functions which use "vxxxss" instructions while 
performing many moves.)

Big thanks
[1] https://run.dlang.io/is/F3Xye3