pi benchmark on ldc and dmd
Adam D. Ruppe
destructionator at gmail.com
Tue Aug 2 12:49:33 PDT 2011
I think I have it: 64 bit registers. I got ldc to work
in 32 bit (didn't have that yesterday, so I was doing 64 bit only)
and compiled.
No difference in timing between ldc 32 bit and dmd 32 bit.
The disassembly isn't identical but the time is. (The disassembly
seems to mainly order things differently, but ldc has fewer jump
instructions too.)
Anyway.
In 64 bit, ldc gets a speedup over dmd. Looking at the asm
output, it looks like dmd doesn't use any of the new registers,
whereas ldc does. (dmd's 64 bit looks mostly like 32 bit code with
r instead of e.)
Here's the program. It's based on one of the Python ones.
====
import std.bigint;
import std.stdio;
alias BigInt number;
void main() {
auto N = 10000;
number i, k, ns;
number k1 = 1;
number n,a,d,t,u;
n = 1;
d = 1;
while(1) {
k += 1;
t = n<<1;
n *= k;
a += t;
k1 += 2;
a *= k1;
d *= k1;
if(a >= n) {
t = (n*3 +a)/d;
u = (n*3 +a)%d;
u += n;
if(d > u) {
ns = ns*10 + t;
i += 1;
if(i % 10 == 0) {
debug writefln ("%010d\t:%d", ns, i);
ns = 0;
}
if(i >= N) {
break;
}
a -= d*t;
a *= 10;
n *= 10;
}
}
}
}
=====
BigInt's calls aren't inlined, but that's a frontend issue. Let's
eliminate that by switching to long in that alias.
The result will be wrong, but that's beside the point for now. I
just want to see integer math. (this is why the writefln is debug
too)
With optimizations turned on, ldc again wins by the same ratio -
it runs in about 2/3 the time - and the code is much easier to look
at.
Let's see what's going on.
The relevant loop from DMD (64 bit):
===
L47: inc qword ptr -040h[RBP]
mov RAX,-028h[RBP]
add RAX,RAX
mov -010h[RBP],RAX
mov RAX,-040h[RBP]
imul RAX,-028h[RBP]
mov -028h[RBP],RAX
mov RAX,-010h[RBP]
add -020h[RBP],RAX
add qword ptr -030h[RBP],2
mov RAX,-030h[RBP]
imul RAX,-020h[RBP]
mov -020h[RBP],RAX
mov RAX,-030h[RBP]
imul RAX,-018h[RBP]
mov -018h[RBP],RAX
mov RAX,-020h[RBP]
cmp RAX,-028h[RBP]
jl L47
mov RAX,-028h[RBP]
lea RAX,[RAX*2][RAX]
add RAX,-020h[RBP]
mov -058h[RBP],RAX
cqo
idiv qword ptr -018h[RBP]
mov -010h[RBP],RAX
mov RAX,-058h[RBP]
cqo
idiv qword ptr -018h[RBP]
mov -8[RBP],RDX
mov RAX,-028h[RBP]
add -8[RBP],RAX
mov RAX,-018h[RBP]
cmp RAX,-8[RBP]
jle L47
mov RAX,-038h[RBP]
lea RAX,[RAX*4][RAX]
add RAX,RAX
add RAX,-010h[RBP]
mov -038h[RBP],RAX
inc qword ptr -048h[RBP]
mov RAX,-048h[RBP]
mov RCX,0Ah
cqo
idiv RCX
test RDX,RDX
jne L109
mov qword ptr -038h[RBP],0
L109: cmp qword ptr -048h[RBP],02710h
jge L137
mov RAX,-018h[RBP]
imul RAX,-010h[RBP]
sub -020h[RBP],RAX
imul EAX,-020h[RBP],0Ah
mov -020h[RBP],RAX
imul EAX,-028h[RBP],0Ah
mov -028h[RBP],RAX
jmp L47
===
and from ldc 64 bit:
====
L20: add RDI,2
inc RCX
lea R9,[R10*2][R9]
imul R9,RDI
imul R8,RDI
imul R10,RCX
cmp R9,R10
jl L20
lea RAX,[R10*2][R10]
add RAX,R9
cqo
idiv R8
add RDX,R10
cmp R8,RDX
jle L20
cmp RSI,0270Fh
jg L73
imul RAX,R8
sub R9,RAX
add R9,R9
lea R9,[R9*4][R9]
inc RSI
add R10,R10
lea R10,[R10*4][R10]
jmp short L20
===
First thing that immediately pops out is the code is a lot shorter.
Second thing that jumps out is it looks like ldc makes better use
of the registers. Indeed, the shortness looks to be thanks to
the registers eliminating a lot of movs.
So I'm pretty sure the difference is caused by dmd not using the
new registers in x64. The other differences look trivial to my
eyes.
More information about the Digitalmars-d
mailing list