Programming language benchmarks

Thu Apr 28 04:50:19 PDT 2011

Moritz Warning:

> I think this might be interesting; it's also using D.

I was about to post this :-)
The little benchmark:
http://attractivechaos.wordpress.com/2011/04/25/my-programming-language-benchmarks-plb/
http://attractivechaos.github.com/plb/

(The matrix mul code is not a realistic example of code because people use library code to do this operation.)

Original C code:

double **mm_mul(int n, double *const *a, double *const *b)
{
    int i, j, k;
    double **m, **c;
    m = mm_init(n); c = mm_init(n);
    for (i = 0; i < n; ++i) // transpose
        for (j = 0; j < n; ++j)
            c[i][j] = b[j][i];
    for (i = 0; i < n; ++i) {
        double *p = a[i], *q = m[i];
        for (j = 0; j < n; ++j) {
            double t = 0.0, *r = c[j];
            for (k = 0; k < n; ++k)
                t += p[k] * r[k];
            q[j] = t;
        }
    }
    mm_destroy(n, c);
    return m;
}

The original D1 code of the matrix mul, coming from the Java version:

double[][] matmul(double[][] a, double[][] b) {
    int m = a.length, n = a[0].length, p = b[0].length;
    double[][] x, c;
    x.length = m; c.length = p;
    for (int i = 0; i < m; ++i) x[i].length = p;
    for (int i = 0; i < p; ++i) c[i].length = n;
    for (int i = 0; i < n; ++i) // transpose
        for (int j = 0; j < p; ++j)
            c[j][i] = b[i][j];
    for (int i = 0; i < m; ++i)
        for (int j = 0; j < p; ++j) {
            double s = 0.0;
            for (int k = 0; k < n; ++k)
                s += a[i][k] * c[j][k];
            x[i][j] = s;
        }
    return x;
}

Improved D2 code:
http://codepad.org/aifQYhjI

pure nothrow double[][] matMul(in double[][] a, in double[][] b) {
    immutable int m = a.length,
                  n = a[0].length,
                  p = b[0].length;

    // transpose
    auto c = new double[][](p, n);
    foreach (i, brow; b)
        foreach (j, bx; brow)
            c[j][i] = bx;

    auto x = new double[][](m, p);

    foreach (i, arow; a)
        foreach (j, crow; c) {
            // x[i][j] = std.numeric.dotProduct(arow, crow); // right way
            double s = 0.0;
            foreach (k, arowk; arow)
                s += arowk * crow[k];
            x[i][j] = s;
        }

    return x;
}

Lua code:

function matrix.mul(a, b, n)
	local y = matrix.new(n, n)
	local c = matrix.T(b, n)
	for i = 1, n - 1 do
		for j = 1, n - 1 do
			local sum = 0
			for k = 0, n - 1 do sum = sum + a[i][k] * c[j][k] end
			y[i][j] = sum
		end
	end
	return y
end

--------------------

Timings, seconds:
  ICC-12.0.3               C    1.8
  GCC-4.3.2                C    2.3
  GDC-0.24                 D    8.8 (DMD gives similar timings)
  Java at JRE-1.6.0_14        Java 2.9
  LuaJIT-2.0.0-beta6 (JIT) Lua  2.7

Improved D version: about 4.4 seconds

Inner loop GCC:

L25:
	fldl	(%edx,%eax,8)
	fmull	(%ecx,%eax,8)
	addl	$1, %eax
	cmpl	%ebx, %eax
	faddp	%st, %st(1)
	jne	L25

Inner loop DMD, original version:

L153:	mov	EDX,4[EBX]
		mov	EAX,[EBX]
		mov	EAX,024h[ESP]
		fld	qword ptr [ESI*8][EDX]
		mov	EDX,028h[ESP]
		mov	EAX,[EDI*8][EDX]
		mov	EDX,4[EDI*8][EDX]
		fmul	qword ptr [ESI*8][EDX]
		inc	ESI
		cmp	ESI,EBP
		fadd	qword ptr 034h[ESP]
		fstp	qword ptr 034h[ESP]
		jl	L153

Inner loop DMD, improved:

L172:	fld	qword ptr [EBX*8][ECX]
		fmul	qword ptr [EBX*8][ESI]
		inc	EBX
		cmp	EBX,04Ch[ESP]
		fadd	qword ptr 054h[ESP]
		fstp	qword ptr 054h[ESP]
		jb	L172

The asm produced by DMD perfoms too many memory accesses in the inner loop. Is it possible to improve the DMD backend to remove this problem?

Bye,
bearophile