Optimizing for SIMD: best practices?(i.e. what features are allowed?)
z
z at z.com
Thu Feb 25 11:28:14 UTC 2021
How does one optimize code to make full use of the CPU's SIMD
capabilities?
Is there any way to guarantee that "packed" versions of SIMD
instructions will be used?(e.g. vmulps, vsqrtps, etc...)
To give some context, this is a sample of one of the functions
that could benefit from better SIMD usage :
>float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {
> float distance;
> a[] -= b[];
> a[] *= a[];
> static foreach(size_t i; 0 .. 3/+typeof(a).length+/){
> distance += a[i].abs;//abs required by the caller
> }
> return sqrt(distance);
>}
>vmovsd xmm0,qword ptr ds:[rdx]
>vmovss xmm1,dword ptr ds:[rdx+8]
>vmovsd xmm2,qword ptr ds:[rcx+4]
>vsubps xmm0,xmm0,xmm2
>vsubss xmm1,xmm1,dword ptr ds:[rcx+C]
>vmulps xmm0,xmm0,xmm0
>vmulss xmm1,xmm1,xmm1
>vbroadcastss xmm2,dword ptr ds:[<__real at 7fffffff>]
>vandps xmm0,xmm0,xmm2
>vpermilps xmm3,xmm0,F5
>vaddss xmm0,xmm0,xmm3
>vandps xmm1,xmm1,xmm2
>vaddss xmm0,xmm0,xmm1
>vsqrtss xmm0,xmm0,xmm0
>vmovaps xmm6,xmmword ptr ss:[rsp+20]
>add rsp,38
>ret
I've tried to experiment with dynamic arrays of float[3] but the
output assembly seemed to be worse.[1](in short, it's calling
internal D functions which use "vxxxss" instructions while
performing many moves.)
Big thanks
[1] https://run.dlang.io/is/F3Xye3
More information about the Digitalmars-d-learn
mailing list