Simple performance question from a newcomer

Sun Feb 21 08:29:26 PST 2016

On Sunday, 21 February 2016 at 14:32:15 UTC, dextorious wrote:
> I've been vaguely aware of D for many years, but the recent 
> addition of std.experimental.ndslice finally inspired me to 
> give it a try, since my main expertise lies in the domain of 
> scientific computing and I primarily use Python/Julia/C++, 
> where multidimensional arrays can be handled with a great deal 
> of expressiveness and flexibility. Before writing anything 
> serious, I wanted to get a sense for the kind of code I would 
> have to write to get the best performance for numerical 
> calculations, so I wrote a trivial summation benchmark. The 
> following code gave me slightly surprising results:
>
> import std.stdio;
> import std.array : array;
> import std.algorithm;
> import std.datetime;
> import std.range;
> import std.experimental.ndslice;
>
> void main() {
> 	int N = 1000;
> 	int Q = 20;
> 	int times = 1_000;
> 	double[] res1 = uninitializedArray!(double[])(N);
> 	double[] res2 = uninitializedArray!(double[])(N);
> 	double[] res3 = uninitializedArray!(double[])(N);
> 	auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
> 	StopWatch sw;
> 	double t0, t1, t2;
> 	sw.start();
> 	foreach (unused; 0..times) {
> 		for (int i=0; i<N; ++i) {
> 			res1[i] = sumtest1(f[i]);
> 		}
> 	}
> 	sw.stop();
> 	t0 = sw.peek().msecs;
> 	sw.reset();
> 	sw.start();
> 	foreach (unused; 0..times) {
> 		for (int i=0; i<N; ++i) {
> 			res2[i] = sumtest2(f[i]);
> 		}
> 	}
> 	sw.stop();
> 	t1 = sw.peek().msecs;
> 	sw.reset();
> 	sw.start();
> 	foreach (unused; 0..times) {
> 		sumtest3(f, res3, N, Q);
> 	}
> 	t2 = sw.peek().msecs;
> 	writeln(t0, " ms");
> 	writeln(t1, " ms");
> 	writeln(t2, " ms");
> 	assert( res1 == res2 );
> 	assert( res2 == res3 );
> }
>
> auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
> 	return range.sum;
> }
>
> auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
> 	double retval = 0.0;
> 	foreach (double f_ ; f)	{
> 		retval += f_;
> 	}
> 	return retval;
> }
>
> auto sumtest3(Range)(Range f, double[] retval, double N, double 
> Q) @safe pure nothrow @nogc {
> 	for (int i=0; i<N; ++i)	{
> 		for (int j=1; j<Q; ++j)	{
> 			retval[i] += f[i,j];
> 		}
> 	}
> }
>
> When I compiled it using dmd -release -inline -O -noboundscheck 
> ../src/main.d, I got the following timings:
> 1268 ms
> 312 ms
> 271 ms
>
> I had heard while reading up on the language that in D explicit 
> loops are generally frowned upon and not necessary for the 
> usual performance reasons. Nevertheless, the two explicit loop 
> functions gave me an improvement by a factor of 4+. 
> Furthermore, the difference between sumtest2 and sumtest3 seems 
> to indicate that function calls have a significant overhead. I 
> also tried using f.reduce!((a, b) => a + b) instead of f.sum in 
> sumtest1, but that yielded even worse performance. I did not 
> try the GDC/LDC compilers yet, since they don't seem to be up 
> to date on the standard library and don't include the ndslice 
> package last I checked.
>
> Now, seeing as how my experience writing D is literally a few 
> hours, is there anything I did blatantly wrong? Did I miss any 
> optimizations? Most importantly, can the elegant operator 
> chaining style be generally made as fast as the explicit loops 
> we've all been writing for decades?

The problem is not with ranges, but with the particualr algorithm 
used for summing. If you look at the docs 
(http://dlang.org/phobos-prerelease/std_algorithm_iteration.html#.sum) you'll see that if the range has random-access `sum` will use the pair-wise algorithm. About the second and third tests, the problem is with DMD which should not be used when measuring performance (but only for development, because it has fast compile-times).

These are the results that I get with LDC:
Pair-wise (sumtest1):
415 ms
21 ms
20 ms

And if I use the Kahan algorithm:
106 ms
36 ms
31 ms
The second two results are probably larger due to noise.

And if I increase N to 100_000:
Pair-wise (sumtest1):
29557 ms
2061 ms
1990 ms

Kahan:
4566 ms
2067 ms
1990 ms

According to `dub --verbose`, my command-line was roughly this:
ldc2 -ofapp -release -O5 -singleobj -w source/app.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/internal.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/iteration.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/package.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/selection.d
../../../../.dub/packages/mir-0.10.1-alpha/source/mir/ndslice/slice.d