Optimizing for SIMD: best practices?(i.e. what features are allowed?)
Bruce Carneal
bcarneal at gmail.com
Sun Mar 7 20:03:43 UTC 2021
On Sunday, 7 March 2021 at 14:15:58 UTC, z wrote:
> On Thursday, 25 February 2021 at 14:28:40 UTC, Guillaume Piolat
> wrote:
>> On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
>>> How does one optimize code to make full use of the CPU's SIMD
>>> capabilities?
>>> Is there any way to guarantee that "packed" versions of SIMD
>>> instructions will be used?(e.g. vmulps, vsqrtps, etc...)
>>
>> https://code.dlang.org/packages/intel-intrinsics
>
> I'd try to use it but the platform i'm building on requires AVX
> to get the most performance.
The code below might be worth a try on your AVX512 machine.
Unless you're looking for a combined result, you might need to
separate out the memory access overhead by running multiple
passes over a "known optimal for L2" data set.
Also note that I compiled with -preview=in. I don't know if that
matters.
import std.math : sqrt;
enum SIMDBits = 512; // 256 was tested, 512 was not
alias A = float[SIMDBits / (float.sizeof * 8)];
pragma(inline, true)
void soaEuclidean(ref A a0, in A a1, in A a2, in A a3, in A
b1, in A b2, in A b3)
{
alias V = __vector(A);
static V vsqrt(V v)
{
A a = cast(A) v;
static foreach (i; 0 .. A.length)
a[i] = sqrt(a[i]);
return cast(V)a;
}
static V sd(in A a, in A b)
{
V v = cast(V) b - cast(V) a;
return v * v;
}
auto v = sd(a1, b1) + sd(a2, b2) + sd(a3, b3);
a0[] = vsqrt(v)[];
}
More information about the Digitalmars-d-learn
mailing list