Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Fri Feb 26 03:57:12 UTC 2021

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
>>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {

Use __vector(float[4]), not float[3].

>>   float distance;

The default value for float is float.nan. You need to explicitly 
initialize it to 0.0f or something if you want this function to 
actually do anything useful.

>>   a[] -= b[];
>>   a[] *= a[];

With __vector types, this can be simplified (not optimized) to 
just:
     a -= b;
     a *= a;

>>   static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
>>       distance += a[i].abs;//abs required by the caller

(a * a) above is always positive for real numbers. You don't need 
the call to abs unless you're trying to guarantee that even nan 
values will have a clear sign bit.

Also, there is no point to adding the first component to zero, 
and copying element [0] from a SIMD register into a scalar is 
free, so this can become:

     float distance = a[0];
     static foreach(size_t i; 1 .. 3)
         distance += a[i];

>>   }
>>   return sqrt(distance);
>>}

Final assembly output (ldc 1.24.0 with -release -O3 
-preview=intpromote -preview=dip1000 -m64 -mcpu=haswell 
-fp-contract=fast -enable-cross-module-inlining):

     vsubps  xmm0, xmm1, xmm0
     vmulps  xmm0, xmm0, xmm0
     vmovshdup       xmm1, xmm0
     vaddss  xmm1, xmm0, xmm1
     vpermilpd       xmm0, xmm0, 1
     vaddss  xmm0, xmm0, xmm1
     vsqrtss xmm0, xmm0, xmm0
     ret