Optimizing for SIMD: best practices?(i.e. what features are allowed?)
tsbockman
thomas.bockman at gmail.com
Sun Mar 7 22:54:32 UTC 2021
On Sunday, 7 March 2021 at 13:26:37 UTC, z wrote:
> On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
> However, AVX512 support seems limited to being able to use the
> 16 other YMM registers, rather than using the same code
> template but changed to use ZMM registers and double the
> offsets to take advantage of the new size.
> Compiled with «-g -enable-unsafe-fp-math
> -enable-no-infs-fp-math -ffast-math -O -release -mcpu=skylake» :
You're not compiling with AVX512 enabled. You would need to use
-mcpu=skylake-avx512.
However, LLVM's code generation for AVX512 seems to be pretty
terrible still, so you'll need to either use some inline ASM, or
stick with AVX2. Here's a structure of arrays style example:
import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3, const(V))
a, ref Repeat!(3, const(V)) b, out V result)
if(is(V : __vector(float[length]), size_t length))
{
Repeat!(3, V) diffSq = a;
static foreach(i; 0 .. 3) {
diffSq[i] -= b[i];
diffSq[i] *= diffSq[i];
}
result = diffSq[0];
static foreach(i; 0 .. 3)
result += diffSq[i];
version(LDC) { version(X86_64) {
enum isSupportedPlatform = true;
import ldc.llvmasm : __asm;
result = __asm!V(`vsqrtps $1, $0`, `=x, x`, result);
} }
static assert(isSupportedPlatform);
}
Resulting asm with is(V == __vector(float[16])):
.LCPI1_0:
.long 0x7fc00000
pure nothrow @nogc void
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), out __vector(float[16])):
mov rax, qword ptr [rsp + 8]
vbroadcastss zmm0, dword ptr [rip + .LCPI1_0]
vmovaps zmmword ptr [rdi], zmm0
vmovaps zmm0, zmmword ptr [rax]
vmovaps zmm1, zmmword ptr [r9]
vmovaps zmm2, zmmword ptr [r8]
vsubps zmm0, zmm0, zmmword ptr [rcx]
vmulps zmm0, zmm0, zmm0
vsubps zmm1, zmm1, zmmword ptr [rdx]
vsubps zmm2, zmm2, zmmword ptr [rsi]
vaddps zmm0, zmm0, zmm0
vfmadd231ps zmm0, zmm1, zmm1
vfmadd231ps zmm0, zmm2, zmm2
vmovaps zmmword ptr [rdi], zmm0
vsqrtps zmm0, zmm0
vmovaps zmmword ptr [rdi], zmm0
vzeroupper
ret
More information about the Digitalmars-d-learn
mailing list