Simple performance question from a newcomer

Daniel Kozak via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Sun Feb 21 07:56:40 PST 2016


You can use -profile to see what is causing it.

   Num          Tree        Func        Per
   Calls        Time        Time        Call

23000000   550799875   550243765          23     pure nothrow @nogc 
@safe double std.algorithm.iteration.sumPairwise!(double, 
std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, 
double, double).iota(double, double, 
double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, 
std.range.iota!(double, double, double).iota(double, double, 
double).Result).Slice)

Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):
> I've been vaguely aware of D for many years, but the recent addition 
> of std.experimental.ndslice finally inspired me to give it a try, 
> since my main expertise lies in the domain of scientific computing and 
> I primarily use Python/Julia/C++, where multidimensional arrays can be 
> handled with a great deal of expressiveness and flexibility. Before 
> writing anything serious, I wanted to get a sense for the kind of code 
> I would have to write to get the best performance for numerical 
> calculations, so I wrote a trivial summation benchmark. The following 
> code gave me slightly surprising results:
>
> import std.stdio;
> import std.array : array;
> import std.algorithm;
> import std.datetime;
> import std.range;
> import std.experimental.ndslice;
>
> void main() {
>     int N = 1000;
>     int Q = 20;
>     int times = 1_000;
>     double[] res1 = uninitializedArray!(double[])(N);
>     double[] res2 = uninitializedArray!(double[])(N);
>     double[] res3 = uninitializedArray!(double[])(N);
>     auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
>     StopWatch sw;
>     double t0, t1, t2;
>     sw.start();
>     foreach (unused; 0..times) {
>         for (int i=0; i<N; ++i) {
>             res1[i] = sumtest1(f[i]);
>         }
>     }
>     sw.stop();
>     t0 = sw.peek().msecs;
>     sw.reset();
>     sw.start();
>     foreach (unused; 0..times) {
>         for (int i=0; i<N; ++i) {
>             res2[i] = sumtest2(f[i]);
>         }
>     }
>     sw.stop();
>     t1 = sw.peek().msecs;
>     sw.reset();
>     sw.start();
>     foreach (unused; 0..times) {
>         sumtest3(f, res3, N, Q);
>     }
>     t2 = sw.peek().msecs;
>     writeln(t0, " ms");
>     writeln(t1, " ms");
>     writeln(t2, " ms");
>     assert( res1 == res2 );
>     assert( res2 == res3 );
> }
>
> auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
>     return range.sum;
> }
>
> auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
>     double retval = 0.0;
>     foreach (double f_ ; f)    {
>         retval += f_;
>     }
>     return retval;
> }
>
> auto sumtest3(Range)(Range f, double[] retval, double N, double Q) 
> @safe pure nothrow @nogc {
>     for (int i=0; i<N; ++i)    {
>         for (int j=1; j<Q; ++j)    {
>             retval[i] += f[i,j];
>         }
>     }
> }
>
> When I compiled it using dmd -release -inline -O -noboundscheck 
> ../src/main.d, I got the following timings:
> 1268 ms
> 312 ms
> 271 ms
>
> I had heard while reading up on the language that in D explicit loops 
> are generally frowned upon and not necessary for the usual performance 
> reasons. Nevertheless, the two explicit loop functions gave me an 
> improvement by a factor of 4+. Furthermore, the difference between 
> sumtest2 and sumtest3 seems to indicate that function calls have a 
> significant overhead. I also tried using f.reduce!((a, b) => a + b) 
> instead of f.sum in sumtest1, but that yielded even worse performance. 
> I did not try the GDC/LDC compilers yet, since they don't seem to be 
> up to date on the standard library and don't include the ndslice 
> package last I checked.
>
> Now, seeing as how my experience writing D is literally a few hours, 
> is there anything I did blatantly wrong? Did I miss any optimizations? 
> Most importantly, can the elegant operator chaining style be generally 
> made as fast as the explicit loops we've all been writing for decades?



More information about the Digitalmars-d-learn mailing list