Simple performance question from a newcomer

Sun Feb 21 08:01:10 PST 2016

So I guess pairwise summation is one to blame here.

Dne 21.2.2016 v 16:56 Daniel Kozak napsal(a):
> You can use -profile to see what is causing it.
>
>   Num          Tree        Func        Per
>   Calls        Time        Time        Call
>
> 23000000   550799875   550243765          23     pure nothrow @nogc 
> @safe double std.algorithm.iteration.sumPairwise!(double, 
> std.experimental.ndslice.slice.Slice!(1uL, std.range.iota!(double, 
> double, double).iota(double, double, 
> double).Result).Slice).sumPairwise(std.experimental.ndslice.slice.Slice!(1uL, 
> std.range.iota!(double, double, double).iota(double, double, 
> double).Result).Slice)
>
> Dne 21.2.2016 v 15:32 dextorious via Digitalmars-d-learn napsal(a):
>> I've been vaguely aware of D for many years, but the recent addition 
>> of std.experimental.ndslice finally inspired me to give it a try, 
>> since my main expertise lies in the domain of scientific computing 
>> and I primarily use Python/Julia/C++, where multidimensional arrays 
>> can be handled with a great deal of expressiveness and flexibility. 
>> Before writing anything serious, I wanted to get a sense for the kind 
>> of code I would have to write to get the best performance for 
>> numerical calculations, so I wrote a trivial summation benchmark. The 
>> following code gave me slightly surprising results:
>>
>> import std.stdio;
>> import std.array : array;
>> import std.algorithm;
>> import std.datetime;
>> import std.range;
>> import std.experimental.ndslice;
>>
>> void main() {
>>     int N = 1000;
>>     int Q = 20;
>>     int times = 1_000;
>>     double[] res1 = uninitializedArray!(double[])(N);
>>     double[] res2 = uninitializedArray!(double[])(N);
>>     double[] res3 = uninitializedArray!(double[])(N);
>>     auto f = iota(0.0, 1.0, 1.0 / Q / N).sliced(N, Q);
>>     StopWatch sw;
>>     double t0, t1, t2;
>>     sw.start();
>>     foreach (unused; 0..times) {
>>         for (int i=0; i<N; ++i) {
>>             res1[i] = sumtest1(f[i]);
>>         }
>>     }
>>     sw.stop();
>>     t0 = sw.peek().msecs;
>>     sw.reset();
>>     sw.start();
>>     foreach (unused; 0..times) {
>>         for (int i=0; i<N; ++i) {
>>             res2[i] = sumtest2(f[i]);
>>         }
>>     }
>>     sw.stop();
>>     t1 = sw.peek().msecs;
>>     sw.reset();
>>     sw.start();
>>     foreach (unused; 0..times) {
>>         sumtest3(f, res3, N, Q);
>>     }
>>     t2 = sw.peek().msecs;
>>     writeln(t0, " ms");
>>     writeln(t1, " ms");
>>     writeln(t2, " ms");
>>     assert( res1 == res2 );
>>     assert( res2 == res3 );
>> }
>>
>> auto sumtest1(Range)(Range range) @safe pure nothrow @nogc {
>>     return range.sum;
>> }
>>
>> auto sumtest2(Range)(Range f) @safe pure nothrow @nogc {
>>     double retval = 0.0;
>>     foreach (double f_ ; f)    {
>>         retval += f_;
>>     }
>>     return retval;
>> }
>>
>> auto sumtest3(Range)(Range f, double[] retval, double N, double Q) 
>> @safe pure nothrow @nogc {
>>     for (int i=0; i<N; ++i)    {
>>         for (int j=1; j<Q; ++j)    {
>>             retval[i] += f[i,j];
>>         }
>>     }
>> }
>>
>> When I compiled it using dmd -release -inline -O -noboundscheck 
>> ../src/main.d, I got the following timings:
>> 1268 ms
>> 312 ms
>> 271 ms
>>
>> I had heard while reading up on the language that in D explicit loops 
>> are generally frowned upon and not necessary for the usual 
>> performance reasons. Nevertheless, the two explicit loop functions 
>> gave me an improvement by a factor of 4+. Furthermore, the difference 
>> between sumtest2 and sumtest3 seems to indicate that function calls 
>> have a significant overhead. I also tried using f.reduce!((a, b) => a 
>> + b) instead of f.sum in sumtest1, but that yielded even worse 
>> performance. I did not try the GDC/LDC compilers yet, since they 
>> don't seem to be up to date on the standard library and don't include 
>> the ndslice package last I checked.
>>
>> Now, seeing as how my experience writing D is literally a few hours, 
>> is there anything I did blatantly wrong? Did I miss any 
>> optimizations? Most importantly, can the elegant operator chaining 
>> style be generally made as fast as the explicit loops we've all been 
>> writing for decades?
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-learn/attachments/20160221/193fcc2d/attachment-0001.html>