Simple performance question from a newcomer

dextorious via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Sun Feb 21 06:32:15 PST 2016


I've been vaguely aware of D for many years, but the recent 
addition of std.experimental.ndslice finally inspired me to give 
it a try, since my main expertise lies in the domain of 
scientific computing and I primarily use Python/Julia/C++, where 
multidimensional arrays can be handled with a great deal of 
expressiveness and flexibility. Before writing anything serious, 
I wanted to get a sense for the kind of code I would have to 
write to get the best performance for numerical calculations, so 
I wrote a trivial summation benchmark. The following code gave me 
slightly surprising results:

import std.stdio;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.range;
import std.experimental.ndslice;

void main() {
	int N = 1000;
	int Q = 20;
	int times = 1_000;
	double[] res1 = uninitializedArray!(double[])(N);
	double[] res2 = uninitializedArray!(double[])(N);
	double[] res3 = uninitializedArray!(double[])(N);
	auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
	StopWatch sw;
	double t0, t1, t2;
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res1[i] = sumtest1(f[i]);
		}
	}
	sw.stop();
	t0 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res2[i] = sumtest2(f[i]);
		}
	}
	sw.stop();
	t1 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		sumtest3(f, res3, N, Q);
	}
	t2 = sw.peek().msecs;
	writeln(t0, " ms");
	writeln(t1, " ms");
	writeln(t2, " ms");
	assert( res1 == res2 );
	assert( res2 == res3 );
}

auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
	return range.sum;
}

auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
	double retval = 0.0;
	foreach (double f_ ; f)	{
		retval += f_;
	}
	return retval;
}

auto sumtest3(Range)(Range f, double[] retval, double N, double 
Q) @safe pure nothrow @nogc {
	for (int i=0; i<N; ++i)	{
		for (int j=1; j<Q; ++j)	{
			retval[i] += f[i,j];
		}
	}
}

When I compiled it using dmd -release -inline -O -noboundscheck 
../src/main.d, I got the following timings:
1268 ms
312 ms
271 ms

I had heard while reading up on the language that in D explicit 
loops are generally frowned upon and not necessary for the usual 
performance reasons. Nevertheless, the two explicit loop 
functions gave me an improvement by a factor of 4+. Furthermore, 
the difference between sumtest2 and sumtest3 seems to indicate 
that function calls have a significant overhead. I also tried 
using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, 
but that yielded even worse performance. I did not try the 
GDC/LDC compilers yet, since they don't seem to be up to date on 
the standard library and don't include the ndslice package last I 
checked.

Now, seeing as how my experience writing D is literally a few 
hours, is there anything I did blatantly wrong? Did I miss any 
optimizations? Most importantly, can the elegant operator 
chaining style be generally made as fast as the explicit loops 
we've all been writing for decades?


More information about the Digitalmars-d-learn mailing list