auto vectorization of interleaves

Johan j at j.nl
Mon Jan 10 19:21:06 UTC 2022


On Monday, 10 January 2022 at 03:04:22 UTC, Bruce Carneal wrote:
> On Monday, 10 January 2022 at 00:17:48 UTC, Johan wrote:
>> On Sunday, 9 January 2022 at 20:21:41 UTC, Bruce Carneal wrote:
>>> With -mcpu=native, -O3, and an EPYC/zen1 target, ldc 1.28 
>>> vectorizes the code below when T == ubyte but does not 
>>> vectorize that code when T == ushort.
>>>
>>> Intra cache throughput testing on a 2.4GhZ zen1 reveals:
>>>   30GB/sec -- custom template function in vanilla D (no asm, 
>>> no intrinsics)
>>>   27GB/sec -- auto vectorized ubyte
>>>    6GB/sec -- non vectorized ushort
>>>
>>> I'll continue to use that custom code, so no particular 
>>> urgency here, but if anyone of the LDC crew can, off the top 
>>> of their head, shed some light on this I'd be interested.  My 
>>> guess is that the cost/benefit function in play here does not 
>>> take bandwidth into account at all.
>>>
>>>
>>> void interleave(T)(T* s0, T* s1, T* s2, T* s3, T[4][] quads)
>>> {
>>>     foreach (i, ref dst; quads[])
>>>     {
>>>         dst[0] = s0[i];
>>>         dst[1] = s1[i];
>>>         dst[2] = s2[i];
>>>         dst[3] = s3[i];
>>>     }
>>> }
>>
>> Hi Bruce,
>>   This could be due to a number of things. Probably it's due 
>> to pointer aliasing possibility. Could also be alignment 
>> assumptions.
>>
>
> I don't think it's pointer aliasing since the 10 line template 
> function seen above was used for both ubyte and ushort 
> instantiations.  The ubyte instantiation auto vectorized 
> nicely.  The ushort instantiation did not.
>
> Also, I dont think the unaligned vector load/store instructions 
> have alignment restrictions. They are generated by LDC when you 
> have something like:
>   ushort[8]* sap = ...
>   auto tmp = cast(__vector(ushort[8]))sap[0]; // turns into:  
> vmovups ...

The compiler complains about aliasing when optimizing.
https://d.godbolt.org/z/hnGj3G3zo

For example, the write to `dst[0]` may alias with `s1[i]` so 
`s1[i]` needs to be reloaded. I think the problem gets worse with 
16bit numbers because they may partially overlap? (8bits of 
dst[0] overlap with s1[i]) Just a guess of why the lookup tables 
`.LCPI0_x` are generated...

-Johan




More information about the digitalmars-d-ldc mailing list