How to tune numerical D? (matrix multiplication is faster in g++ vs gdc)
John Colvin
john.loughran.colvin at gmail.com
Sun Mar 3 20:02:28 PST 2013
On Monday, 4 March 2013 at 03:48:45 UTC, J wrote:
> Dear D pros,
>
> As a fan of D, I was hoping to be able to get similar results
> as this fellow on stack overflow, by noting his tuning steps;
> http://stackoverflow.com/questions/5142366/how-fast-is-d-compared-to-c
>
> Sadly however, when I pull out a simple matrix multiplication
> benchmark from the old language shootout (back when it had D),
> it is disturbingly slower in D when pit against C++.
>
> Details? I ran with very recent gdc (gcc 4.7.2, gdc on the
> 4.7.2 branch, pullreq #51, commit
> b8f5c22b0e7afa7e68a287ed788597e783540063), and the exact same
> gcc c++ compiler.
>
> How would I tune this to be more competitive? I'm comparing
> gdc vs g++ both built using the exact same gcc-4.7.2 back end,
> so it has to be something in the front end. I've disabled GC
> after the matrices are made in D, so that doesn't explain it.
>
> What is going on? I'm hoping I'm making a silly, naive,
> obvious beginner mistake, but could that be? I'm not sure how
> to apply the 'in' argument advice given on stackoverflow; if
> that is the answer, could someone summarise the best practice
> for 'in' use?
>
> Thank you!
>
> - J
>
> $ g++ --version #shows: g++ (GCC) 4.7.2
> $ uname -a
> Linux gofast 2.6.35-24-generic #42-Ubuntu SMP Thu Dec 2
> 02:41:37 UTC 2010 x86_64 GNU/Linux
>
> # first, g++, two runs:
>
> $ g++ -O3 matrix.cpp -ocppmatrix
> $ time ./cppmatrix
> -1015380632 859379360 -367726792 -1548829944
>
> real 1m31.941s
> user 1m31.920s
> sys 0m0.010s
> $ time ./cppmatrix
> -1015380632 859379360 -367726792 -1548829944
>
> real 1m32.068s
> user 1m32.010s
> sys 0m0.050s
>
>
> # second, gdc, two runs:
>
> $ gdmd -O -inline -release -noboundscheck -m64 matrix.d
> -ofdmatrix
> $ time ./dmatrix
> -1015380632 859379360 -367726792 -1548829944
>
> real 2m10.677s
> user 2m10.650s
> sys 0m0.020s
> $
> $ time ./dmatrix
> -1015380632 859379360 -367726792 -1548829944
>
> real 2m12.664s
> user 2m12.600s
> sys 0m0.030s
>
> # SIZE = 2000 results:
>
> # It appears D (gdc) is 30% slower that C++ (g++); using the
> exact same backend compiler.
>
> # it doesn't even appear to help to request O3 directly: it
> goes slower--
>
> $ gdmd -O -q,-O3 -inline -release -noboundscheck -m64 matrix.d
> -ofdmatrix
> $ time ./dmatrix
> -1015380632 859379360 -367726792 -1548829944
>
> real 2m17.107s
> user 2m17.080s
> sys 0m0.020s
> jaten at afarm:~/tmp$
>
>
> # Though still beating java, but not by much. (Java code not
> shown; it's same source as all of these; the historical
> http://shootout.alioth.debian.org/ code from when D was in the
> shootout.)
>
> $ time java matrix
> -1015380632 859379360 -367726792 -1548829944
>
> real 2m23.739s
> user 2m23.650s
> sys 0m0.130s
> $
>
>
> Slightly bigger matrix?
>
> SIZE = 2500 results: 25% slower in D
>
> $ time ./cpp.O3.matrix
> -1506465222 -119774408 -1600478274 1285663906
>
> real 3m1.340s
> user 3m1.290s
> sys 0m0.040s
>
> $ time ./dmatrix
> -1506465222 -119774408 -1600478274 1285663906
>
> real 4m2.109s
> user 4m2.050s
> sys 0m0.050s
>
>
> //////// D version
>
> import core.memory;
>
> import std.stdio, std.string, std.array, std.conv;
>
> const int SIZE = 2000;
>
> int main(string[] args)
> {
> int i, n = args.length > 1 ? to!int(args[1]) : 1;
>
> int[][] m1 = mkmatrix(SIZE,SIZE);
> int[][] m2 = mkmatrix(SIZE,SIZE);
> int[][] mm = mkmatrix(SIZE,SIZE);
>
> GC.disable;
>
> for (i=0; i<n; i++) {
> mmult(m1, m2, mm);
> }
>
> writefln("%d %d %d %d",mm[0][0],mm[2][3],mm[3][2],mm[4][4]);
>
> return 0;
> }
>
> int[][] mkmatrix(int rows, int cols)
> {
> int[][] m;
> int count = 1;
>
> m.length = rows;
> foreach(ref int[] mi; m)
> {
> mi.length = cols;
> foreach(ref int mij; mi)
> {
> mij = count++;
> }
> }
>
> return(m);
> }
>
> void mmult(int[][] m1, int[][] m2, int[][] m3)
> {
> foreach(int i, int[] m1i; m1)
> {
> foreach(int j, ref int m3ij; m3[i])
> {
> int val;
> foreach(int k, int[] m2k; m2)
> {
> val += m1i[k] * m2k[j];
> }
> m3ij = val;
> }
> }
> }
>
> ////// C++ version
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define SIZE 2000
>
> int **mkmatrix(int rows, int cols) {
> int i, j, count = 1;
> int **m = (int **) malloc(rows * sizeof(int *));
> for (i=0; i<rows; i++) {
> m[i] = (int *) malloc(cols * sizeof(int));
> for (j=0; j<cols; j++) {
> m[i][j] = count++;
> }
> }
> return(m);
> }
>
> void zeromatrix(int rows, int cols, int **m) {
> int i, j;
> for (i=0; i<rows; i++)
> for (j=0; j<cols; j++)
> m[i][j] = 0;
> }
>
> void freematrix(int rows, int **m) {
> while (--rows > -1) { free(m[rows]); }
> free(m);
> }
>
> int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
> int i, j, k, val;
> for (i=0; i<rows; i++) {
> for (j=0; j<cols; j++) {
> val = 0;
> for (k=0; k<cols; k++) {
> val += m1[i][k] * m2[k][j];
> }
> m3[i][j] = val;
> }
> }
> return(m3);
> }
>
> int main(int argc, char *argv[]) {
> int i, n = ((argc == 2) ? atoi(argv[1]) : 1);
>
> int **m1 = mkmatrix(SIZE, SIZE);
> int **m2 = mkmatrix(SIZE, SIZE);
> int **mm = mkmatrix(SIZE, SIZE);
>
> for (i=0; i<n; i++) {
> mm = mmult(SIZE, SIZE, m1, m2, mm);
> }
> printf("%d %d %d %d\n", mm[0][0], mm[2][3], mm[3][2],
> mm[4][4]);
>
> freematrix(SIZE, m1);
> freematrix(SIZE, m2);
> freematrix(SIZE, mm);
> return(0);
> }
First things first:
You're not just timing the multiplication, you're timing the
memory allocation as well. I suggest using
http://dlang.org/phobos/std_datetime.html#StopWatch to do some
proper timings in D
Also, there is a semi-documented multi-dimensional array
allocation syntax that is very neat, see here a simplified
version of mkmatrix using it:
int[][] mkmatrix(size_t rows, size_t cols)
{
int[][] m = new int[][](rows, cols);
size_t count = 1;
foreach(ref mi; m)
foreach(ref mij; mi)
mij = count++;
return(m);
}
However, I have found myself that D is slower than C for these
sort of intense numerical things. The assembly code should show
why quite easily.
More information about the Digitalmars-d
mailing list