auto vectorization of interleaves
Johan
j at j.nl
Mon Jan 10 00:17:48 UTC 2022
On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
> With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28
> vectorizes the code below when T == ubyte but does not
> vectorize that code when T == ushort.
>
> Intra cache throughput testing on a 2.4GhZ zen1 reveals:
> 30GB/sec -- custom template function in vanilla D (no asm, no
> intrinsics)
> 27GB/sec -- auto vectorized ubyte
> 6GB/sec -- non vectorized ushort
>
> I'll continue to use that custom code, so no particular urgency
> here, but if anyone of the LDC crew can, off the top of their
> head, shed some light on this I'd be interested. My guess is
> that the cost/benefit function in play here does not take
> bandwidth into account at all.
>
>
> void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
> {
> foreach (i, ref dst; quads[])
> {
> dst[0] = s0[i];
> dst[1] = s1[i];
> dst[2] = s2[i];
> dst[3] = s3[i];
> }
> }
Hi Bruce,
This could be due to a number of things. Probably it's due to
pointer aliasing possibility. Could also be alignment assumptions.
Your message is unclear though: what's the difference between the
"custom template" and the other two? Most clear to just provide
the code of all three, without templates. (If it is a template,
then cross-module inlining is possible, is that causing a speed
boost?)
You can look at the LLVM IR output (--output-ll) to understand
better why/what is (not) happening inside the optimizer.
-Johan
More information about the digitalmars-d-ldc
mailing list