auto vectorization of interleaves
Bruce Carneal
bcarneal at gmail.com
Mon Jan 10 03:04:22 UTC 2022
On Monday, 10 January 2022 at 00:17:48 UTC, Johan wrote:
> On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
>> With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28
>> vectorizes the code below when T == ubyte but does not
>> vectorize that code when T == ushort.
>>
>> Intra cache throughput testing on a 2.4GhZ zen1 reveals:
>> 30GB/sec -- custom template function in vanilla D (no asm,
>> no intrinsics)
>> 27GB/sec -- auto vectorized ubyte
>> 6GB/sec -- non vectorized ushort
>>
>> I'll continue to use that custom code, so no particular
>> urgency here, but if anyone of the LDC crew can, off the top
>> of their head, shed some light on this I'd be interested. My
>> guess is that the cost/benefit function in play here does not
>> take bandwidth into account at all.
>>
>>
>> void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
>> {
>> foreach (i, ref dst; quads[])
>> {
>> dst[0] = s0[i];
>> dst[1] = s1[i];
>> dst[2] = s2[i];
>> dst[3] = s3[i];
>> }
>> }
>
> Hi Bruce,
> This could be due to a number of things. Probably it's due to
> pointer aliasing possibility. Could also be alignment
> assumptions.
>
I don't think it's pointer aliasing since the 10 line template
function seen above was used for both ubyte and ushort
instantiations. The ubyte instantiation auto vectorized nicely.
The ushort instantiation did not.
Also, I dont think the unaligned vector load/store instructions
have alignment restrictions. They are generated by LDC when you
have something like:
ushort[8]* sap = ...
auto tmp = cast(__vector(ushort[8]))sap[0]; // turns into:
vmovups ...
> Your message is unclear though: what's the difference between
> the "custom template" and the other two? Most clear to just
> provide the code of all three, without templates. (If it is a
> template, then cross-module inlining is possible, is that
> causing a speed boost?)
The 10 liner above was responsible for the 27GB/sec ubyte
performance and the 6GB/sec ushort performance. The "custom
template", not shown, is a 35 LOC template function that I wrote
to accomplish, at speed, what the 10 LOC template could not.
Note that the LDC auto-vectorized ubyte instantiation of that
simple 10-liner is very good. I'm trying to understand why LDC
does such a great job with T == ubyte, yet fails to vectorize at
all when T == ushort.
>
> You can look at the LLVM IR output (--output-ll) to understand
> better why/what is (not) happening inside the optimizer.
>
> -Johan
I took a look at the IR but didn't see an explanation. I may
have missed something there... the output is a little noisy.
On a lark I tried the clang flags that enable vectorization
diagnostics. Unsurprisingly :-) those did not work.
If/when there is a clamoring for better auto-vectorization,
enabling clang style analysis/diagnostics might be cost effective.
In the mean time, the workarounds D affords are not bad at all
and will likely be preferred in known hot-paths for their
predictability anyway.
Thanks for the response and thanks again for your very useful
contributions to my everyday compiler.
More information about the digitalmars-d-ldc
mailing list