auto vectorization of interleaves

Sun Jan 9 20:21:41 UTC 2022

With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 
vectorizes the code below when T == ubyte but does not vectorize 
that code when T == ushort.

Intra cache throughput testing on a 2.4GhZ zen1 reveals:
   30GB/sec -- custom template function in vanilla D (no asm, no 
intrinsics)
   27GB/sec -- auto vectorized ubyte
    6GB/sec -- non vectorized ushort

I'll continue to use that custom code, so no particular urgency 
here, but if anyone of the LDC crew can, off the top of their 
head, shed some light on this I'd be interested.  My guess is 
that the cost/benefit function in play here does not take 
bandwidth into account at all.

void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
{
     foreach (i, ref dst; quads[])
     {
         dst[0] = s0[i];
         dst[1] = s1[i];
         dst[2] = s2[i];
         dst[3] = s3[i];
     }
}