Performance issue with @fastmath and vectorization
deXtoRious via digitalmars-d-ldc
digitalmars-d-ldc at puremagic.com
Sat Nov 12 07:44:28 PST 2016
Okay, so I've done some further experimentation with rather
peculiar results. On the bright side, I'm now fairly sure this
isn't an outright bug in the compiler. On the flip side, however,
I'm quite confused about the results.
For the record, here are the current versions of the benchmark in
godbolt:
D: https://godbolt.org/g/B8gosP
C++: https://godbolt.org/g/DWjQrV
Apparently, LDC can be coaxed to use FMA instructions after all.
It seems that with __attribute__((__weak__)) Clang produces code
that is essentially identical to the D binary, both run in about
19ms on my machine. When I remove __attribute__((__weak__)) and
make the compute_neq function static void rather than simply
void, Clang further unrolls the inner loop and uses a number of
optimized load/store instructions that increase the performance
by a huge margin - down to about 7ms. As for LDC, changing
adding/removing @weak and static also has a major impact on the
generated code and therefore the performance.
I have not found any way to make LDC perform the same
optimizations as Clang's best case (simply static void, no weak
attribute) and have run out of ideas. Furthermore, I have no idea
why the aforementioned changes in the function declaration affect
the both optimizers in this way, or whether finer control over
vectorization/loop unrolling is possible in LDC. Any thoughts?
More information about the digitalmars-d-ldc
mailing list