A small comparison
bearophile
bearophileHUGS at lycos.com
Wed Nov 20 14:12:28 PST 2013
I've found a post about a functional library for LuaJIT code:
http://rtsisyk.github.io/luafun/intro.html
This dynamically typed Lua code:
require "fun" ()
n = 100
x = sum(map(function(x) return x^2 end, take(n,
tabulate(math.sin))))
-- calculate sum(sin(x)^2 for x in 0..n-1)
print(x)
50.011981355266
Gets compiled to:
->LOOP:
0bcaffd0 movsd [rsp+0x8], xmm7
0bcaffd6 addsd xmm4, xmm5
0bcaffda ucomisd xmm6, xmm1
0bcaffde jnb 0x0bca0028 ->6
0bcaffe4 addsd xmm6, xmm0
0bcaffe8 addsd xmm7, xmm0
0bcaffec fld qword [rsp+0x8]
0bcafff0 fsin
0bcafff2 fstp qword [rsp]
0bcafff5 movsd xmm5, [rsp]
0bcafffa mulsd xmm5, xmm5
0bcafffe jmp 0x0bcaffd0 ->LOOP
So I've written a similar D version using Phobos:
void main() {
import std.stdio, std.algorithm, std.range;
enum n = 100;
immutable double r = n.iota.map!(x => x.sin ^^ 2).reduce!q{a
+ b};
r.writeln;
}
LDC2 compiles it to (32 bit):
LBB0_1:
fxch
fstpt 44(%esp)
fld %st(0)
fstpt (%esp)
fld1
faddp %st(1)
fstpt 32(%esp)
calll __D3std4math3sinFNaNbNfeZe
subl $12, %esp
fmul %st(0), %st(0)
fldt 44(%esp)
faddp %st(1)
fstpt 44(%esp)
fldt 32(%esp)
fldt 44(%esp)
decl %esi
fxch
jne LBB0_1
As you see the D version doesn't use the fsin instruction as
LuaJIT, it calls a library function.
If I import the sin from core.stdc.math ldc2 produces (notice the
usage of xmm registers):
LBB0_1:
movsd %xmm1, 32(%esp)
movsd %xmm0, 40(%esp)
movsd %xmm1, (%esp)
calll _sin
fstpl 48(%esp)
movsd 48(%esp), %xmm0
mulsd %xmm0, %xmm0
movsd 40(%esp), %xmm1
addsd %xmm0, %xmm1
movsd %xmm1, 40(%esp)
movsd 32(%esp), %xmm1
movsd 40(%esp), %xmm0
addsd LCPI0_0, %xmm1
decl %esi
jne LBB0_1
I have also written a normal for loop in both C and D. This is C
code:
#include "stdio.h"
#include "math.h"
int main() {
double r = 0.0;
int i;
for (i = 0; i < 100; i++) {
const double aux = sin(i);
r += aux * aux;
}
printf("%f\n", r);
return 0;
}
GCC compiles it to:
.L3:
cvtsi2sd %ebx, %xmm0
call sin
.L2:
mulsd %xmm0, %xmm0
addl $1, %ebx
cmpl $100, %ebx
addsd 8(%rsp), %xmm0
movsd %xmm0, 8(%rsp)
jne .L3
The Intel compiler compiles it to this (this is even vectorized,
sin2 and mulpd work on two doubles at a time):
..B1.2: # Preds ..B1.8 ..B1.7
cvtdq2pd %xmm8, %xmm0
#8.32
call __svml_sin2
#8.28
mulpd %xmm0, %xmm0
#9.20
addb $2, %r12b
#7.5
paddd %xmm9, %xmm8
#5.14
addpd %xmm0, %xmm10
#9.9
cmpb $100, %r12b
#7.5
jb ..B1.2 # Prob 99%
#7.5
In scientific code it's common to have small kernels that take
most of the run-time of a program, so it's important to shave
every instructions from such loops.
Do you think Phobos/LDC2 are doing well enough here?
Bye,
bearophile
More information about the digitalmars-d-ldc
mailing list