[Issue 10636] New: Vector calling convention for D?

Sat Jul 13 13:41:14 PDT 2013

http://d.puremagic.com/issues/show_bug.cgi?id=10636

           Summary: Vector calling convention for D?
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: DMD
        AssignedTo: nobody at puremagic.com
        ReportedBy: bearophile_hugs at eml.cc

--- Comment #0 from bearophile_hugs at eml.cc 2013-07-13 13:41:13 PDT ---
VS2013 designers have added a new calling convention, that allows to pass SIMD
registers to functions avoiding the stack in most cases:
http://blogs.msdn.com/b/vcblog/archive/2013/07/12/introducing-vector-calling-convention.aspx

An example D program:

import core.stdc.stdio, core.simd;

struct Particle { float4 x, y; }

Particle addParticles(in Particle p1, in Particle p2)
pure nothrow {
    return Particle(p1.x + p2.x, p1.y + p2.y);
}

void main() {
    auto p1 = Particle([1, 2, 3, 4],
                       [10, 20, 30, 40]);
    printf("%f %f %f %f %f %f %f %f\n",
           p1.x.array[0], p1.x.array[1],
           p1.x.array[2], p1.x.array[3],
           p1.y.array[0], p1.y.array[1],
           p1.y.array[2], p1.y.array[3]);

    auto p2 = Particle([100, 200, 300, 400],
                       [1000, 2000, 3000, 4000]);
    printf("%f %f %f %f %f %f %f %f\n",
           p2.x.array[0], p2.x.array[1],
           p2.x.array[2], p2.x.array[3],
           p2.y.array[0], p2.y.array[1],
           p2.y.array[2], p2.y.array[3]);

    auto p3 = addParticles(p1, p2);
    printf("%f %f %f %f %f %f %f %f\n",
           p3.x.array[0], p3.x.array[1],
           p3.x.array[2], p3.x.array[3],
           p3.y.array[0], p3.y.array[1],
           p3.y.array[2], p3.y.array[3]);
}

Comping that code with the ldc2 v.0.11.0 on Windows 32bit with:

ldc2 -O5 -disable-inlining -release -vectorize-slp -vectorize-slp-aggressive
-output-s test.d

It outputs the X86 asm:

__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle:
    pushl    %ebp
    movl    %esp, %ebp
    andl    $-16, %esp
    subl    $16, %esp
    movaps    40(%ebp), %xmm0
    movaps    56(%ebp), %xmm1
    addps    8(%ebp), %xmm0
    addps    24(%ebp), %xmm1
    movups    %xmm1, 16(%eax)
    movups    %xmm0, (%eax)
    movl    %ebp, %esp
    popl    %ebp
    ret    $64

__Dmain:
...
    movaps    160(%esp), %xmm0
    movaps    176(%esp), %xmm1
    movaps    %xmm1, 48(%esp)
    movaps    %xmm0, 32(%esp)
    movaps    128(%esp), %xmm0
    movaps    144(%esp), %xmm1
    movaps    %xmm1, 16(%esp)
    movaps    %xmm0, (%esp)
    leal    96(%esp), %eax
    calll   
__D4test12addParticlesFNaNbxS4test8ParticlexS4test8ParticleZS4test8Particle
    subl    $64, %esp
    movss    96(%esp), %xmm0
    movss    100(%esp), %xmm1
    movss    104(%esp), %xmm2
    movss    108(%esp), %xmm3
    movss    112(%esp), %xmm4
    movss    116(%esp), %xmm5
    movss    120(%esp), %xmm6
    movss    124(%esp), %xmm7
    cvtss2sd    %xmm7, %xmm7
    movsd    %xmm7, 60(%esp)
    cvtss2sd    %xmm6, %xmm6
    movsd    %xmm6, 52(%esp)
    cvtss2sd    %xmm5, %xmm5
    movsd    %xmm5, 44(%esp)
    cvtss2sd    %xmm4, %xmm4
    movsd    %xmm4, 36(%esp)
    cvtss2sd    %xmm3, %xmm3
    movsd    %xmm3, 28(%esp)
    cvtss2sd    %xmm2, %xmm2
    movsd    %xmm2, 20(%esp)
    cvtss2sd    %xmm1, %xmm1
    movsd    %xmm1, 12(%esp)
    cvtss2sd    %xmm0, %xmm0
    movsd    %xmm0, 4(%esp)
    movl    $_.str3, (%esp)
    calll    ___mingw_printf
    xorl    %eax, %eax
    movl    %ebp, %esp
    popl    %ebp
    ret

As shown in that article, with a vector calling convention to set the arguments
of addParticles it needs only four movaps (instead of eigth and the leal). With
the vectr calling convention the body of addParticles gets short, because all
the needed operands are already in xmm registers. Probably the code of
addParticles becomes only two addps, a ret and maybe two movaps to put the
result in the right output registers.

D is meant to be useful for people that write fast video games, or other
numerical code, and both use plenty of SIMD code. So I think adding such
optimization can be useful. But I can't estimate how much advantage it's going
to give, benchmarks are needed. They write:

>Today on AMD64 target, passed by value vector arguments (such as __m128/__m256/) must be turned into a passed by address of a temporary buffer (i.e. $T1, $T2, $T3 in the figure above) allocated in caller's local stack as shown in the figure above. We have been receiving increasing concerns about this inefficiency in past years, especially from game, graphic, video/audio, and codec domains. A concrete example is MS XNA library in which passing vector arguments is a common pattern in many APIs of XNAMath library. The inefficiency will be intensified on upcoming AVX2/AVX3 and future processors with wider vector registers.<

On the other hand small functions get inlined, and introducing a new calling
convention has a disadvantage, as comment by Iain Buclaw:

> I'd vote for not adding more fluff which makes ABI differences 
> between compilers greater.  But it certainly looks like if would 
> be useful if you wish to save the time taken to copy the vector 
> from XMM registers onto the stack and back again when passing 
> values around.

Maybe such vector calling convention will become more standard in future, as it
seems an useful improvement.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------