Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Sun Mar 7 20:03:43 UTC 2021

On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
> On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat 
> wrote:
>> On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
>>> How does one optimize code to make full use of the CPU's SIMD 
>>> capabilities?
>>> Is there any way to guarantee that "packed" versions of SIMD 
>>> instructions will be used?(e.g. vmulps, vsqrtps, etc...)
>>
>> https://code.dlang.org/packages/intel-intrinsics
>
> I'd try to use it but the platform i'm building on requires AVX 
> to get the most performance.

The code below might be worth a try on your AVX512 machine.

Unless you're looking for a combined result, you might need to 
separate out the memory access overhead by running multiple 
passes over a "known optimal for L2" data set.

Also note that I compiled with -preview=in.  I don't know if that 
matters.

import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
     void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A 
b1, in A b2, in A b3)
{
     alias V = __vector(A);
     static V vsqrt(V v)
     {
         A a = cast(A) v;
         static foreach (i; 0 .. A.length)
             a[i] = sqrt(a[i]);
         return cast(V)a;
     }

     static V sd(in A a, in A b)
     {
         V v = cast(V) b - cast(V) a;
         return v * v;
     }

     auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
     a0[] = vsqrt(v)[];
}