poor codegen for abs(), and div by literal?
Basile B.
b2.temp at gmx.com
Sat Feb 9 03:28:41 UTC 2019
On Thursday, 7 February 2019 at 23:15:08 UTC, NaN wrote:
> Given the following code...
>
>
> module ohreally;
>
> import std.math;
>
> float foo(float y, float x)
> {
> float ax = fabs(x);
> float ay = fabs(y);
> return ax*ay/3.142f;
> }
>
>
> the compiler outputs the following for the function body,
> (excluding prolog and epilogue code)...
>
> movss -010h[RBP],XMM0
> movss -8[RBP],XMM1
> fld float ptr -010h[RBP]
> fabs
> fstp qword ptr -020h[RBP]
> movsd XMM0,-020h[RBP]
> cvtsd2ss XMM0,XMM0
> fld float ptr -8[RBP]
> fabs
> fstp qword ptr -020h[RBP]
> movsd XMM1,-020h[RBP]
>
> cvtsd2ss XMM2,XMM1
> mulss XMM0,XMM2
> movss XMM3,FLAT:.rodata[00h][RIP]
> divss XMM0,XMM3
>
> So to do the abs(), it stores to memory from XMM reg, loads
> into x87 FPU regs, does the abs with the old FPU instruction,
> then for some reason stores the result as a double, loads that
> back into an XMM, converts it back to single.
>
> And the div by 3.142f, is there a reason it cant be converted
> to a multiply? I know I can coax the multiply by doing
> *(1.0f/3.142f) instead, but I wondered if there's some
> reasoning in why its not done automatically?
>
> Is any of this worth add to the bug tracker?
Essentially the problem is that fabs() is always done with the
FPU so with DMD the values always trip between the SSE registers,
the stack of temporaries and the FPU registers. But fabs()
doesn't have to be made in extended precision, it's not like the
trigo operations after all, it's just about a **single bit**...
On amd64 fabs (for single and double) could be done using SSE
**only**. I don't know how compiler intrinsics work but the SSE
version is not a single instruction. It's either 3 (generate a
mask + logical and) or 2 (left shift by 1, right shift by 1 to
clear the sign)
A SSE-only would be more something like:
pcmpeqd xmm2, xmm2
psrld xmm2, 01h
andps xmm0, xmm2
andps xmm1, xmm2
mulss xmm0, xmm1
mulss xmm0, dword ptr [<address of constant>]
ret
in iasm (note: sadly the constant cannot be set to a static
immutable):
extern(C) float foo2(float y, float x, const float z = 1.0f /
3.142f)
{
asm pure nothrow
{
naked;
pcmpeqd XMM3, XMM3;
psrld XMM3, 1;
andps XMM0, XMM3;
andps XMM1, XMM3;
mulss XMM0, XMM1;
mulss XMM0, XMM2;
ret;
}
}
LDC2 does almost that, excepted that the logical AND is in a sub
program:
push rax
movss dword ptr [rsp+04h], xmm1
call 000000000045A020h
movss dword ptr [rsp], xmm0
movss xmm0, dword ptr [rsp+04h]
call 000000000045A020h
mulss xmm0, dword ptr [rsp]
mulss xmm0, dword ptr [<address of constant>]
pop rax
ret
000000000049DA20h:
andps xmm0, dqword ptr [<address of mask>]
ret
To come back to the bug of the "tripping values", it's known and
it can even happen when the FPU is not used [1]
[1] https://issues.dlang.org/show_bug.cgi?id=17965
More information about the Digitalmars-d
mailing list