Performance issue with @fastmath and vectorization

Sat Nov 12 01:45:29 PST 2016

On Saturday, 12 November 2016 at 00:03:16 UTC, dextorious wrote:
> On my Haswell i7-4710HQ machine the C++ version runs in 
> ~10ms/iteration while the D code takes 25ms. Comparing profiler 
> output with the generated assembly code quickly reveals the 
> reason - while Clang fully unrolls the inner loop and uses FMA 
> instructions wherever possible, the inner loop assembly 
> produced by LDC looks like this:

By compiling your code with the same set of flags you used on the 
godbolt (https://d.godbolt.org/) service I do see the FMA 
instructions being used.