auto vectorization of interleaves
Bruce Carneal
bcarneal at gmail.com
Sun Jan 9 20:21:41 UTC 2022
With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28
vectorizes the code below when T == ubyte but does not vectorize
that code when T == ushort.
Intra cache throughput testing on a 2.4GhZ zen1 reveals:
30GB/sec -- custom template function in vanilla D (no asm, no
intrinsics)
27GB/sec -- auto vectorized ubyte
6GB/sec -- non vectorized ushort
I'll continue to use that custom code, so no particular urgency
here, but if anyone of the LDC crew can, off the top of their
head, shed some light on this I'd be interested. My guess is
that the cost/benefit function in play here does not take
bandwidth into account at all.
void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
{
foreach (i, ref dst; quads[])
{
dst[0] = s0[i];
dst[1] = s1[i];
dst[2] = s2[i];
dst[3] = s3[i];
}
}
More information about the digitalmars-d-ldc
mailing list