Performance issue with @fastmath and vectorization

Sat Nov 12 02:27:53 PST 2016

On Saturday, 12 November 2016 at 09:45:29 UTC, LiNbO3 wrote:
> On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
>> On my Haswell i7-4710HQ machine the C++ version runs in 
>> ~10ms/iteration while the D code takes 25ms. Comparing 
>> profiler output with the generated assembly code quickly 
>> reveals the reason - while Clang fully unrolls the inner loop 
>> and uses FMA instructions wherever possible, the inner loop 
>> assembly produced by LDC looks like this:
>
> By compiling your code with the same set of flags you used on 
> the godbolt (https://d.godbolt.org/) service I do see the FMA 
> instructions being used.

There are three vfmadd231ss in the entire assembly, but none of 
them are in the inner loop. The presence of any FMA instructions 
at all does show that the compiler properly accepts the -mcpu 
switch, but it doesn't seem to recognize the opportunities 
present in the inner loop. The assembly generated by the godbolt 
service seems largely identical to the one I got on my local 
machine.