auto vectorization of interleaves
Johan
j at j.nl
Mon Jan 10 19:21:06 UTC 2022
On Monday, 10 January 2022 at 03:04:22 UTC, Bruce Carneal wrote:
> On Monday, 10 January 2022 at 00:17:48 UTC, Johan wrote:
>> On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
>>> With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28
>>> vectorizes the code below when T == ubyte but does not
>>> vectorize that code when T == ushort.
>>>
>>> Intra cache throughput testing on a 2.4GhZ zen1 reveals:
>>> 30GB/sec -- custom template function in vanilla D (no asm,
>>> no intrinsics)
>>> 27GB/sec -- auto vectorized ubyte
>>> 6GB/sec -- non vectorized ushort
>>>
>>> I'll continue to use that custom code, so no particular
>>> urgency here, but if anyone of the LDC crew can, off the top
>>> of their head, shed some light on this I'd be interested. My
>>> guess is that the cost/benefit function in play here does not
>>> take bandwidth into account at all.
>>>
>>>
>>> void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
>>> {
>>> foreach (i, ref dst; quads[])
>>> {
>>> dst[0] = s0[i];
>>> dst[1] = s1[i];
>>> dst[2] = s2[i];
>>> dst[3] = s3[i];
>>> }
>>> }
>>
>> Hi Bruce,
>> This could be due to a number of things. Probably it's due
>> to pointer aliasing possibility. Could also be alignment
>> assumptions.
>>
>
> I don't think it's pointer aliasing since the 10 line template
> function seen above was used for both ubyte and ushort
> instantiations. The ubyte instantiation auto vectorized
> nicely. The ushort instantiation did not.
>
> Also, I dont think the unaligned vector load/store instructions
> have alignment restrictions. They are generated by LDC when you
> have something like:
> ushort[8]* sap = ...
> auto tmp = cast(__vector(ushort[8]))sap[0]; // turns into:
> vmovups ...
The compiler complains about aliasing when optimizing.
https://d.godbolt.org/z/hnGj3G3zo
For example, the write to `dst[0]` may alias with `s1[i]` so
`s1[i]` needs to be reloaded. I think the problem gets worse with
16bit numbers because they may partially overlap? (8bits of
dst[0] overlap with s1[i]) Just a guess of why the lookup tables
`.LCPI0_x` are generated...
-Johan
More information about the digitalmars-d-ldc
mailing list