poor codegen for abs(), and div by literal?

Sat Feb 9 03:28:41 UTC 2019

On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
> Given the following code...
>
>
> module ohreally;
>
> import std.math;
>
> float foo(float y, float x)
> {
>     float ax = fabs(x);
>     float ay = fabs(y);
>     return ax*ay/3.142f;
> }
>
>
> the compiler outputs the following for the function body, 
> (excluding prolog and epilogue code)...
>
>    movss   -010h[RBP],XMM0
>    movss   -8[RBP],XMM1
>    fld     float ptr -010h[RBP]
>    fabs
>    fstp    qword ptr -020h[RBP]
>    movsd   XMM0,-020h[RBP]
>    cvtsd2ss        XMM0,XMM0
>    fld     float ptr -8[RBP]
>    fabs
>    fstp    qword ptr -020h[RBP]
>    movsd   XMM1,-020h[RBP]
>
>    cvtsd2ss        XMM2,XMM1
>    mulss   XMM0,XMM2
>    movss   XMM3,FLAT:.rodata[00h][RIP]
>    divss   XMM0,XMM3
>
> So to do the abs(), it stores to memory from XMM reg, loads 
> into x87 FPU regs, does the abs with the old FPU instruction, 
> then for some reason stores the result as a double, loads that 
> back into an XMM, converts it back to single.
>
> And the div by 3.142f, is there a reason it cant be converted 
> to a multiply? I know I can coax the multiply by doing 
> *(1.0f/3.142f) instead, but I wondered if there's some 
> reasoning in why its not done automatically?
>
> Is any of this worth add to the bug tracker?

Essentially the problem is that fabs() is always done with the 
FPU so with DMD the values always trip between the SSE registers, 
the stack of temporaries and the FPU registers. But fabs() 
doesn't have to be made in extended precision, it's not like the 
trigo operations after all, it's just about a **single bit**...

On amd64 fabs (for single and double) could be done using SSE 
**only**. I don't know how compiler intrinsics work but the SSE 
version is not a single instruction. It's either 3 (generate a 
mask + logical and) or 2 (left shift by 1, right shift by 1 to 
clear the sign)

A SSE-only would be more something like:

   pcmpeqd xmm2, xmm2
   psrld xmm2, 01h
   andps xmm0, xmm2
   andps xmm1, xmm2
   mulss xmm0, xmm1
   mulss xmm0, dword ptr [<address of constant>]
   ret

in iasm (note: sadly the constant cannot be set to a static 
immutable):

   extern(C) float foo2(float y, float x, const float z = 1.0f / 
3.142f)
   {
     asm pure nothrow
     {
       naked;
       pcmpeqd XMM3, XMM3;
       psrld   XMM3, 1;
       andps   XMM0, XMM3;
       andps   XMM1, XMM3;
       mulss   XMM0, XMM1;
       mulss   XMM0, XMM2;
       ret;
     }
   }

LDC2 does almost that, excepted that the logical AND is in a sub 
program:

   push rax
   movss dword ptr [rsp+04h], xmm1
   call 000000000045A020h
   movss dword ptr [rsp], xmm0
   movss xmm0, dword ptr [rsp+04h]
   call 000000000045A020h
   mulss xmm0, dword ptr [rsp]
   mulss xmm0, dword ptr [<address of constant>]
   pop rax
   ret

000000000049DA20h:
   andps xmm0, dqword ptr [<address of mask>]
   ret

To come back to the bug of the "tripping values", it's known and 
it can even happen when the FPU is not used [1]

[1] https://issues.dlang.org/show_bug.cgi?id=17965