Performance issue with @fastmath and vectorization

Sat Nov 12 04:40:19 PST 2016

On Saturday, 12 November 2016 at 12:11:35 UTC, Johan Engelen 
wrote:
> On Saturday, 12 November 2016 at 11:16:16 UTC, deXtoRious wrote:
>> On Saturday, 12 November 2016 at 11:04:59 UTC, Johan Engelen 
>> wrote:
>>> On Saturday, 12 November 2016 at 10:56:20 UTC, deXtoRious 
>>> wrote:
>>>> On Saturday, 12 November 2016 at 10:47:42 UTC, Johan Engelen 
>>>> wrote:
>>>>>
>>>>> Does the C++ need `__restrict__` for the parameters to get 
>>>>> the assembly you want?
>>>>
>>>> In this case, it doesn't seem to make any difference.
>>>
>>> That's good news, because there is currently no way to add 
>>> that to LDC code, afaik.
>>
>> I hope it's somewhere on the roadmap for the future, as it 
>> does still make a measurable difference in some cases.
>
> Can you file an issue for that too? (ideas in forum posts get 
> lost instantly)
> Make sure you add an (as small as possible) testcase that shows 
> a clear difference in codegen (with/without for C++), and with 
> worse codegen with D code without it.
> It may be relatively easy to implement it in LDC, but I don't 
> think many people know the intricacies of C's restrict. With 
> examples of the effect it has on assembly (clang C++) helps a 
> lot towards getting it implemented.
>
>> -Ofast is also there out of habit, doesn't make a meaningful 
>> difference for a benchmark as simple as this. Other switches, 
>> like -fno-rtti, -fno-exceptions and even -flto can also be 
>> dropped, simply using -O3 -march=native -ffast-math is 
>> sufficient to outperform LDC by 2.5x, losing only about 10% 
>> from the best C++ performance and producing essentially the 
>> same unrolled FMA-enabled assembly with very minor changes.
>
> OK great.
> I think you ran into a compiler limitation somehow, so make 
> sure you submit the simplified example/testcase on GH ! ;)
> (the simpler you can make it, the better)
>
> Btw, for benchmarking, you should mark the `compute_neq` 
> function as "weak linkage", such that the compiler is not going 
> to do inter-procedural optimization for the call to 
> `compute_neq` in `main`. (@weak for LDC, clang probably 
> something like __attribute__((weak)))

Okay, I'll clean up the code and post an issue on GH later today, 
hopefully someone can figure out where the discrepancy comes from.

I'll also file a separate issue / feature request for restrict 
afterwards, once I write up a representative test case that 
highlights the performance impact.

Thanks for your help! The ability to get quick responses on 
compiler issues like this is really encouraging me to write more 
high performance code in D.