Compilation of a numerical kernel

Sun Jun 27 15:26:39 PDT 2010

Recently I have seen some work from Don about floating point optimization in DMD:
http://d.puremagic.com/issues/show_bug.cgi?id=4380
http://d.puremagic.com/issues/show_bug.cgi?id=4383

so maybe he is interested in this too. This test program is the nested loop of a program, and it's one of the hottest spots, it determines the performance of the whole small program, so even if it's just three lines of code it needs to be optimized well by the compiler (the D code can be modified to unroll the loop few times).

// D code
double foo(double[] arr1, double[] arr2) {
    double diff = 0.0;
    for (int i; i < arr1.length; i++) {
        double aux = arr1[i] - arr2[i];
        diff += aux * aux;
    }
    return diff;
}
void main() {}

D code compiled by DMD, optimized build:
L38:    fld qword ptr [EDX*8][ECX]
        fsub    qword ptr [EDX*8][EBX]
        inc EDX
        cmp EDX,058h[ESP]
        fstp    qword ptr 014h[ESP]
        fld qword ptr 014h[ESP]
        fmul    ST,ST(0)
        fadd    qword ptr 4[ESP]
        fstp    qword ptr 4[ESP]
        jb  L38

D code compiled by LDC, optimized build:
.LBB13_5:
    movsd   (%edi,%ecx,8), %xmm1
    subsd   (%eax,%ecx,8), %xmm1
    incl    %ecx
    cmpl    %esi, %ecx
    mulsd   %xmm1, %xmm1
    addsd   %xmm1, %xmm0
    jne .LBB13_5

The asm produced by dmd is not efficient, it's not a matter of SSE register usage.
I have translated it to C to see how GCC compiles it, to see how compile it with no SSE:

// C code
double foo(double* arr1, double* arr2, int len) {
    double diff = 0.0;
    int i;
    for (i = 0; i < len; i++) {
        double aux = arr1[i] - arr2[i];
        diff += aux * aux;
    }

    return diff;
}

C code compiled with gcc 4.5 (32 bit):
L3:
    fldl    (%ecx,%eax,8)
    fsubl   (%ebx,%eax,8)
    incl    %eax
    fmul    %st(0), %st
    cmpl    %edx, %eax
    faddp   %st, %st(1)
    jne L3

This is an example of how a compiler can compile it, unrolled once and working on two doubles in each SSE instruction (this is on 64 bit too), so this equals to a 4X unroll:

Modified C code compiled with GCC (64 bit):
L3:
    movapd  (%rcx,%rax), %xmm1
    subpd   (%rdx,%rax), %xmm1
    movapd  %xmm1, %xmm0
    mulpd   %xmm1, %xmm0
    addpd   %xmm0, %xmm2
    movapd  16(%rcx,%rax), %xmm0
    subpd   16(%rdx,%rax), %xmm0
    addq    $32, %rax
    mulpd   %xmm1, %xmm0
    cmpq    %r8, %rax
    addpd   %xmm0, %xmm3
    jne L3

Cache prefetch instructions can't help a lot here, because the access pattern to the memory is very plain.

Bye,
bearophile