Simple performance question from a newcomer

Sun Feb 21 06:32:15 PST 2016

I've been vaguely aware of D for many years, but the recent 
addition of std.experimental.ndslice finally inspired me to give 
it a try, since my main expertise lies in the domain of 
scientific computing and I primarily use Python/Julia/C++, where 
multidimensional arrays can be handled with a great deal of 
expressiveness and flexibility. Before writing anything serious, 
I wanted to get a sense for the kind of code I would have to 
write to get the best performance for numerical calculations, so 
I wrote a trivial summation benchmark. The following code gave me 
slightly surprising results:

import std.stdio;
import std.array : array;
import std.algorithm;
import std.datetime;
import std.range;
import std.experimental.ndslice;

void main() {
	int N = 1000;
	int Q = 20;
	int times = 1_000;
	double[] res1 = uninitializedArray!(double[])(N);
	double[] res2 = uninitializedArray!(double[])(N);
	double[] res3 = uninitializedArray!(double[])(N);
	auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
	StopWatch sw;
	double t0, t1, t2;
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res1[i] = sumtest1(f[i]);
		}
	}
	sw.stop();
	t0 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		for (int i=0; i<N; ++i) {
			res2[i] = sumtest2(f[i]);
		}
	}
	sw.stop();
	t1 = sw.peek().msecs;
	sw.reset();
	sw.start();
	foreach (unused; 0..times) {
		sumtest3(f, res3, N, Q);
	}
	t2 = sw.peek().msecs;
	writeln(t0, " ms");
	writeln(t1, " ms");
	writeln(t2, " ms");
	assert( res1 == res2 );
	assert( res2 == res3 );
}

auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
	return range.sum;
}

auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
	double retval = 0.0;
	foreach (double f_ ; f)	{
		retval += f_;
	}
	return retval;
}

auto sumtest3(Range)(Range f, double[] retval, double N, double 
Q) @safe pure nothrow @nogc {
	for (int i=0; i<N; ++i)	{
		for (int j=1; j<Q; ++j)	{
			retval[i] += f[i,j];
		}
	}
}

When I compiled it using dmd -release -inline -O -noboundscheck 
../src/main.d, I got the following timings:
1268 ms
312 ms
271 ms

I had heard while reading up on the language that in D explicit 
loops are generally frowned upon and not necessary for the usual 
performance reasons. Nevertheless, the two explicit loop 
functions gave me an improvement by a factor of 4+. Furthermore, 
the difference between sumtest2 and sumtest3 seems to indicate 
that function calls have a significant overhead. I also tried 
using f.reduce!((a, b) => a + b) instead of f.sum in sumtest1, 
but that yielded even worse performance. I did not try the 
GDC/LDC compilers yet, since they don't seem to be up to date on 
the standard library and don't include the ndslice package last I 
checked.

Now, seeing as how my experience writing D is literally a few 
hours, is there anything I did blatantly wrong? Did I miss any 
optimizations? Most importantly, can the elegant operator 
chaining style be generally made as fast as the explicit loops 
we've all been writing for decades?