auto vectorization of interleaves

Mon Jan 10 00:17:48 UTC 2022

On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
> With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 
> vectorizes the code below when T == ubyte but does not 
> vectorize that code when T == ushort.
>
> Intra cache throughput testing on a 2.4GhZ zen1 reveals:
>   30GB/sec -- custom template function in vanilla D (no asm, no 
> intrinsics)
>   27GB/sec -- auto vectorized ubyte
>    6GB/sec -- non vectorized ushort
>
> I'll continue to use that custom code, so no particular urgency 
> here, but if anyone of the LDC crew can, off the top of their 
> head, shed some light on this I'd be interested.  My guess is 
> that the cost/benefit function in play here does not take 
> bandwidth into account at all.
>
>
> void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
> {
>     foreach (i, ref dst; quads[])
>     {
>         dst[0] = s0[i];
>         dst[1] = s1[i];
>         dst[2] = s2[i];
>         dst[3] = s3[i];
>     }
> }

Hi Bruce,
   This could be due to a number of things. Probably it's due to 
pointer aliasing possibility. Could also be alignment assumptions.

Your message is unclear though: what's the difference between the 
"custom template" and the other two? Most clear to just provide 
the code of all three, without templates. (If it is a template, 
then cross-module inlining is possible, is that causing a speed 
boost?)

You can look at the LLVM IR output (--output-ll) to understand 
better why/what is (not) happening inside the optimizer.

-Johan