LDC/GDC auto vectorization observations
Bruce Carneal
bcarneal at gmail.com
Mon Aug 8 15:49:41 UTC 2022
For much of the code that I throw at it, LDC auto vectorization
is comparable to GDC (both very good) but LDC lags in a few
situations. The link below shows one of those, a gather of the
two primaries from a simple Bayer mosaic pixel:
https://godbolt.org/z/sfd8e4hqe
The throughput difference on my (aging) 2.4GhZ zen1 is
21+GB/sec/core for GDC vs ~9GB/sec for LDC. The expected (hoped
for) auto vec loop starts at .L6 in the GDC output. (I realize
there are other ways to "skin" the Bayer pixel unpacking cat,
this is just an example of where one approach breaks down)
Additionally I'll note that GDC can auto vectorize in the face of
multiple outputs (low degree kernel fusion). I've not seen LDC
do that in the few cases that I've tried.
Both LDC and GDC auto vectorization are sufficiently advanced
that I've been able to eliminate a good deal of __vector code and
the attendant difficulties in target specialization, source code
expansion, and testing. D's __vector capabilities helps out with
the remainder.
Thanks again for providing a very useful tool.
More information about the digitalmars-d-ldc
mailing list