Performance issue with @fastmath and vectorization

Sat Nov 12 02:56:20 PST 2016

On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
wrote:
> On Saturday, 12 November 2016 at 10:27:53 UTC, deXtoRious wrote:
>>
>> There are three vfmadd231ss in the entire assembly, but none 
>> of them are in the inner loop. The presence of any FMA 
>> instructions at all does show that the compiler properly 
>> accepts the -mcpu switch, but it doesn't seem to recognize the 
>> opportunities present in the inner loop.
>
> Does the C++ need `__restrict__` for the parameters to get the 
> assembly you want?

In this case, it doesn't seem to make any difference. It is 
habitual for me to use __restrict__ whenever possible in HPC 
code, but very often Clang/GCC are smart enough nowadays to make 
the inference regardless.

On that note, I was under the impression that D arrays included 
the no aliasing assumption. If that's not the case, is there a 
way to achieve the equivalent of __restrict__ in D?

>
>> The assembly generated by the godbolt service seems largely 
>> identical to the one I got on my local machine.
>
> It is easier for the discussion if you paste godbolt.org links 
> btw, so we don't have to manually do it ourselves ;-)
>
> -Johan

Will do. :)

By the way, I posted that issue on GH: 
https://github.com/ldc-developers/ldc/issues/1874