Performance issue with @fastmath and vectorization

Sat Nov 12 03:16:16 PST 2016

On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
wrote:
> On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious wrote:
>> On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
>> wrote:
>>>
>>> Does the C++ need `__restrict__` for the parameters to get 
>>> the assembly you want?
>>
>> In this case, it doesn't seem to make any difference.
>
> That's good news, because there is currently no way to add that 
> to LDC code, afaik.

I hope it's somewhere on the roadmap for the future, as it does 
still make a measurable difference in some cases.

>
> Hope you can try to cut more of these things from the example 
> so it's easier to figure out why things are different.  (e.g. 
> is -Ofast needed, or is -O3 enough?)
>
> Thanks!
>
> cheers,
>   Johan

-Ofast is also there out of habit, doesn't make a meaningful 
difference for a benchmark as simple as this. Other switches, 
like -fno-rtti, -fno-exceptions and even -flto can also be 
dropped, simply using -O3 -march=native -ffast-math is sufficient 
to outperform LDC by 2.5x, losing only about 10% from the best 
C++ performance and producing essentially the same unrolled 
FMA-enabled assembly with very minor changes.