[Issue 15255] New: Generated better code for saturation arithmetic

Wed Oct 28 20:26:38 PDT 2015

https://issues.dlang.org/show_bug.cgi?id=15255

          Issue ID: 15255
           Summary: Generated better code for saturation arithmetic
           Product: D
           Version: D2
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P1
         Component: dmd
          Assignee: nobody at puremagic.com
          Reporter: bugzilla at digitalmars.com

I've just spent a couple
of hours trying to convince the compiler to generate the code I want
to see for saturating arithmetic. DMD generates a lot of extra work.
LDC does pretty well, but I still couldn't convince it to produce the
code I wanted to see. GDC is the same as LDC, except one of my
permutations seems to have hit the sweet spot! ;)

The asm I'm chasing is:
  add %eax,%ebx
  sbb %eax,%eax
  or %ebx,%eax

I'm pretty sure this is the best result, but I can't convince DMD or
LDC to emit that code.

Code:

pure nothrow @safe @nogc:

template Promote(U)
{
    static if(is(U == ubyte)) alias Promote = ushort;
    else static if(is(U == ushort)) alias Promote = uint;
    else static if(is(U == uint)) alias Promote = ulong;
    else static if(is(U == ulong)) alias Promote = ucent;
    else static assert(false, "!");
}

// note; I cast results everywhere because D's integer promotion
seemed to interfere

U addSat(U)(U a, U b) if(isIntegral!U && isUnsigned!U)
{
    return cast(U)((a > cast(U)(U.max - b)) ? cast(U)U.max : cast(U)(a + b));
}
U addSat2(U)(U a, U b) if(isIntegral!U && isUnsigned!U)
{
    Promote!U c = cast(Promote!U)a+b;
    return cast(U)(-cast(U)(c >> (U.sizeof*8)) | cast(U)c);
    // expect:
    //   add %eax,%ebx
    //   sbb %eax,%eax
    //   or %ebx,%eax
}
U addSat3(U)(U a, U b) if(isIntegral!U && isUnsigned!U)
{
    U c = a + b;
    if (c < a) // can only happen due to overflow
        c = U(-1);
    return c;
}
U addSat4(U)(U a, U b) if(isIntegral!U && isUnsigned!U)
{
    U c = a + b;
    c |= -U(c < a);
    return c;
}

When I generate those functions for uint, I get these:

DMD's output:

pure nothrow @nogc @safe uint
std.experimental.color.internal.saturate.addSat!(uint).addSat(uint,
uint):
  0000000000000000: push        rbp
  0000000000000001: mov         rbp,rsp
  0000000000000004: mov         dword ptr [rbp+10h],ecx
  0000000000000007: mov         dword ptr [rbp+18h],edx
  000000000000000A: mov         eax,0FFFFFFFFh
  000000000000000F: sub         eax,dword ptr [rbp+10h]
  0000000000000012: cmp         eax,dword ptr [rbp+18h]
  0000000000000015: mov         eax,0FFFFFFFFh
  000000000000001A: jb          0000000000000022
  000000000000001C: mov         ecx,dword ptr [rbp+10h]
  000000000000001F: lea         eax,[rdx+rcx]
  0000000000000022: pop         rbp
  0000000000000023: ret

pure nothrow @nogc @safe uint
std.experimental.color.internal.saturate.addSat2!(uint).addSat2(uint,
uint):
  0000000000000000: push        rbp
  0000000000000001: mov         rbp,rsp
  0000000000000004: sub         rsp,10h
  0000000000000008: mov         dword ptr [rbp+10h],ecx
  000000000000000B: mov         dword ptr [rbp+18h],edx
  000000000000000E: mov         eax,dword ptr [rbp+18h]
  0000000000000011: mov         edx,dword ptr [rbp+10h]
  0000000000000014: add         rax,rdx
  0000000000000017: mov         qword ptr [rbp-8],rax
  000000000000001B: mov         eax,dword ptr [rbp-4]
  000000000000001E: neg         eax
  0000000000000020: or          eax,dword ptr [rbp-8]
  0000000000000023: lea         rsp,[rbp]
  0000000000000027: pop         rbp
  0000000000000028: ret

pure nothrow @nogc @safe uint
std.experimental.color.internal.saturate.addSat3!(uint).addSat3(uint,
uint):
  0000000000000000: push        rbp
  0000000000000001: mov         rbp,rsp
  0000000000000004: lea         eax,[rdx+rcx]
  0000000000000007: cmp         eax,edx
  0000000000000009: jae         0000000000000010
  000000000000000B: mov         eax,0FFFFFFFFh
  0000000000000010: pop         rbp
  0000000000000011: ret

pure nothrow @nogc @safe uint
std.experimental.color.internal.saturate.addSat4!(uint).addSat4(uint,
uint):
  0000000000000000: push        rbp
  0000000000000001: mov         rbp,rsp
  0000000000000004: lea         r8d,[rdx+rcx]
  0000000000000008: cmp         r8d,edx
  000000000000000B: setb        al
  000000000000000E: movzx       eax,al
  0000000000000011: neg         eax
  0000000000000013: or          r8d,eax
  0000000000000016: mov         eax,r8d
  0000000000000019: pop         rbp
  000000000000001A: ret

LDC's output:

addSat:
    leal    (%rdx,%rcx), %r8d
    notl    %ecx
    cmpl    %ecx, %edx
    movl    $-1, %eax
    cmovbel    %r8d, %eax
    retq

addSat2:
    movl    %edx, %edx
    movl    %ecx, %eax
    addq    %rdx, %rax
    movq    %rax, %rcx
    shrq    $32, %rcx
    negl    %ecx
    orl    %ecx, %eax
    retq

addSat3:
    addl    %edx, %ecx
    movl    $-1, %eax
    cmovael    %ecx, %eax
    retq

addSat4:
    addl    %edx, %ecx
    movl    $-1, %eax
    cmovael    %ecx, %eax
    retq

GDC is almost identical to LDC, except:

addSat4:
        addl    %edi, %esi
        sbbl    %eax, %eax
        orl     %esi, %eax
        ret

I think this is what I was chasing! (not sure what the deal is with
the choice of registers though?)

Anyway, I'm surprised by almost all these results, I thought all
compilers would do better on such simple code. Thought you might be
interested.

Cheers.
- Manu

--