Performance issue with @fastmath and vectorization

Sat Nov 12 07:44:28 PST 2016

Okay, so I've done some further experimentation with rather 
peculiar results. On the bright side, I'm now fairly sure this 
isn't an outright bug in the compiler. On the flip side, however, 
I'm quite confused about the results.

For the record, here are the current versions of the benchmark in 
godbolt:
D:   https://godbolt.org/g/B8gosP
C++: https://godbolt.org/g/DWjQrV

Apparently, LDC can be coaxed to use FMA instructions after all. 
It seems that with __attribute__((__weak__)) Clang produces code 
that is essentially identical to the D binary, both run in about 
19ms on my machine. When I remove __attribute__((__weak__)) and 
make the compute_neq function static void rather than simply 
void, Clang further unrolls the inner loop and uses a number of 
optimized load/store instructions that increase the performance 
by a huge margin - down to about 7ms. As for LDC, changing 
adding/removing @weak and static also has a major impact on the 
generated code and therefore the performance.

I have not found any way to make LDC perform the same 
optimizations as Clang's best case (simply static void, no weak 
attribute) and have run out of ideas. Furthermore, I have no idea 
why the aforementioned changes in the function declaration affect 
the both optimizers in this way, or whether finer control over 
vectorization/loop unrolling is possible in LDC. Any thoughts?