iasm, unexpectedly slower than DMD production

Mon Sep 12 04:17:21 PDT 2016

On Monday, 12 September 2016 at 00:46:16 UTC, Basile B. wrote:
> I have this function, written in iasm:
>
> °°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
> T foo(T)(T x, T c)
> if (is(T == float) || is(T == double))
> {
>
>     version(none)
>     {
>         return x*x*x - x*x*c + x*c;
>     }
>     else asm
>     {
>         naked;
>         movsd   XMM3, XMM1;
>         mulsd   XMM0, XMM1;
>         mulsd   XMM1, XMM1;
>         movsd   XMM2, XMM1;
>         mulsd   XMM1, XMM3;
>         addsd   XMM1, XMM0;
>         mulsd   XMM0, XMM3;
>         subsd   XMM1, XMM0;
>         movsd   XMM0, XMM1;
>         ret;
>     }
> }
>
> [...]
>
>
> When I change the version(none) to version(all), the benchmark 
> is **7X** faster (e.g 410 against 3000 for the iasm version).
>
> This difference doesn't look normal at all.
> Does anyone know why ? The usage of the stack to move xmm1 in 
> xmm0 is particularly strange...

The function was probably implicitly enclosed by what's normally 
generated for try/catch block. with

     asm nothrow
     {
     }

I get slightly better perfs than the DMD production.
Something to add the specifications I guess, nothing states this 
behavior.