pi benchmark on ldc and dmd

Tue Aug 2 12:49:33 PDT 2011

I think I have it: 64 bit registers. I got ldc to work
in 32 bit (didn't have that yesterday, so I was doing 64 bit only)
and compiled.

No difference in timing between ldc 32 bit and dmd 32 bit.
The disassembly isn't identical but the time is. (The disassembly
seems to mainly order things differently, but ldc has fewer jump
instructions too.)

Anyway.

In 64 bit, ldc gets a speedup over dmd. Looking at the asm
output, it looks like dmd doesn't use any of the new registers,
whereas ldc does. (dmd's 64 bit looks mostly like 32 bit code with
r instead of e.)

Here's the program. It's based on one of the Python ones.

====
import std.bigint;
import std.stdio;

alias BigInt number;

void main() {
	auto N = 10000;

	number i, k, ns;
	number k1 = 1;
	number n,a,d,t,u;
	n = 1;
	d = 1;
	while(1) {
		k += 1;
		t = n<<1;
		n *= k;
		a += t;
		k1 += 2;
		a *= k1;
		d *= k1;
		if(a >= n) {
			t = (n*3 +a)/d;
			u = (n*3 +a)%d;
			u += n;
			if(d > u) {
				ns = ns*10 + t;
				i += 1;
				if(i % 10 == 0) {
					debug writefln ("%010d\t:%d", ns, i);
					ns = 0;
				}
				if(i >= N) {
					break;
				}
				a -= d*t;
				a *= 10;
				n *= 10;
			}
		}
	}
}
=====

BigInt's calls aren't inlined, but that's a frontend issue. Let's
eliminate that by switching to long in that alias.

The result will be wrong, but that's beside the point for now. I
just want to see integer math. (this is why the writefln is debug
too)

With optimizations turned on, ldc again wins by the same ratio -
it runs in about 2/3 the time - and the code is much easier to look
at.

Let's see what's going on.

The relevant loop from DMD (64 bit):

===
L47:		inc	qword ptr -040h[RBP]
		mov	RAX,-028h[RBP]
		add	RAX,RAX
		mov	-010h[RBP],RAX
		mov	RAX,-040h[RBP]
		imul	RAX,-028h[RBP]
		mov	-028h[RBP],RAX
		mov	RAX,-010h[RBP]
		add	-020h[RBP],RAX
		add	qword ptr -030h[RBP],2
		mov	RAX,-030h[RBP]
		imul	RAX,-020h[RBP]
		mov	-020h[RBP],RAX
		mov	RAX,-030h[RBP]
		imul	RAX,-018h[RBP]
		mov	-018h[RBP],RAX
		mov	RAX,-020h[RBP]
		cmp	RAX,-028h[RBP]
		jl	L47
		mov	RAX,-028h[RBP]
		lea	RAX,[RAX*2][RAX]
		add	RAX,-020h[RBP]
		mov	-058h[RBP],RAX
		cqo
		idiv	qword ptr -018h[RBP]
		mov	-010h[RBP],RAX
		mov	RAX,-058h[RBP]
		cqo
		idiv	qword ptr -018h[RBP]
		mov	-8[RBP],RDX
		mov	RAX,-028h[RBP]
		add	-8[RBP],RAX
		mov	RAX,-018h[RBP]
		cmp	RAX,-8[RBP]
		jle	L47
		mov	RAX,-038h[RBP]
		lea	RAX,[RAX*4][RAX]
		add	RAX,RAX
		add	RAX,-010h[RBP]
		mov	-038h[RBP],RAX
		inc	qword ptr -048h[RBP]
		mov	RAX,-048h[RBP]
		mov	RCX,0Ah
		cqo
		idiv	RCX
		test	RDX,RDX
		jne	L109
		mov	qword ptr -038h[RBP],0
L109:		cmp	qword ptr -048h[RBP],02710h
		jge	L137
		mov	RAX,-018h[RBP]
		imul	RAX,-010h[RBP]
		sub	-020h[RBP],RAX
		imul	EAX,-020h[RBP],0Ah
		mov	-020h[RBP],RAX
		imul	EAX,-028h[RBP],0Ah
		mov	-028h[RBP],RAX
		jmp	  L47
===

and from ldc 64 bit:

====
L20:		add	RDI,2
		inc	RCX
		lea	R9,[R10*2][R9]
		imul	R9,RDI
		imul	R8,RDI
		imul	R10,RCX
		cmp	R9,R10
		jl	L20
		lea	RAX,[R10*2][R10]
		add	RAX,R9
		cqo
		idiv	R8
		add	RDX,R10
		cmp	R8,RDX
		jle	L20
		cmp	RSI,0270Fh
		jg	L73
		imul	RAX,R8
		sub	R9,RAX
		add	R9,R9
		lea	R9,[R9*4][R9]
		inc	RSI
		add	R10,R10
		lea	R10,[R10*4][R10]
		jmp short	L20
===

First thing that immediately pops out is the code is a lot shorter.
Second thing that jumps out is it looks like ldc makes better use
of the registers. Indeed, the shortness looks to be thanks to
the registers eliminating a lot of movs.

So I'm pretty sure the difference is caused by dmd not using the
new registers in x64. The other differences look trivial to my
eyes.