Compilation of a numerical kernel
bearophile
bearophileHUGS at lycos.com
Sun Jun 27 15:26:39 PDT 2010
Recently I have seen some work from Don about floating point optimization in DMD:
http://d.puremagic.com/issues/show_bug.cgi?id=4380
http://d.puremagic.com/issues/show_bug.cgi?id=4383
so maybe he is interested in this too. This test program is the nested loop of a program, and it's one of the hottest spots, it determines the performance of the whole small program, so even if it's just three lines of code it needs to be optimized well by the compiler (the D code can be modified to unroll the loop few times).
// D code
double foo(double[] arr1, double[] arr2) {
double diff = 0.0;
for (int i; i < arr1.length; i++) {
double aux = arr1[i] - arr2[i];
diff += aux * aux;
}
return diff;
}
void main() {}
D code compiled by DMD, optimized build:
L38: fld qword ptr [EDX*8][ECX]
fsub qword ptr [EDX*8][EBX]
inc EDX
cmp EDX,058h[ESP]
fstp qword ptr 014h[ESP]
fld qword ptr 014h[ESP]
fmul ST,ST(0)
fadd qword ptr 4[ESP]
fstp qword ptr 4[ESP]
jb L38
D code compiled by LDC, optimized build:
.LBB13_5:
movsd (%edi,%ecx,8), %xmm1
subsd (%eax,%ecx,8), %xmm1
incl %ecx
cmpl %esi, %ecx
mulsd %xmm1, %xmm1
addsd %xmm1, %xmm0
jne .LBB13_5
The asm produced by dmd is not efficient, it's not a matter of SSE register usage.
I have translated it to C to see how GCC compiles it, to see how compile it with no SSE:
// C code
double foo(double* arr1, double* arr2, int len) {
double diff = 0.0;
int i;
for (i = 0; i < len; i++) {
double aux = arr1[i] - arr2[i];
diff += aux * aux;
}
return diff;
}
C code compiled with gcc 4.5 (32 bit):
L3:
fldl (%ecx,%eax,8)
fsubl (%ebx,%eax,8)
incl %eax
fmul %st(0), %st
cmpl %edx, %eax
faddp %st, %st(1)
jne L3
This is an example of how a compiler can compile it, unrolled once and working on two doubles in each SSE instruction (this is on 64 bit too), so this equals to a 4X unroll:
Modified C code compiled with GCC (64 bit):
L3:
movapd (%rcx,%rax), %xmm1
subpd (%rdx,%rax), %xmm1
movapd %xmm1, %xmm0
mulpd %xmm1, %xmm0
addpd %xmm0, %xmm2
movapd 16(%rcx,%rax), %xmm0
subpd 16(%rdx,%rax), %xmm0
addq $32, %rax
mulpd %xmm1, %xmm0
cmpq %r8, %rax
addpd %xmm0, %xmm3
jne L3
Cache prefetch instructions can't help a lot here, because the access pattern to the memory is very plain.
Bye,
bearophile
More information about the Digitalmars-d
mailing list