Modern C++ Lamentations

Tue Mar 5 21:13:38 UTC 2019

On Friday, 4 January 2019 at 16:21:40 UTC, Steven Schveighoffer 
wrote:
> On 1/4/19 9:00 AM, Dukc wrote:
>>>
>> 
>> Isn't the main problem with performance of the Timon's range 
>> loop that it uses arbitrary-sized integers (BigInts)?
>
> Atila's version of that code doesn't use bigints: 
> https://github.com/atilaneves/pythagoras/blob/master/range.d#L24
>
> The major problem with the D range implementation is that the 
> compiler isn't able to find the optimization of hoisting the 
> multiplication of the outer indexes out of the inner loop.
>
> See my responses to Atila in this thread.
>
> -Steve

Now, I'm replying to an old theard, I hope you're still 
interested enough to warrant necrobumping. The thing is, I did do 
some additional testing of the range version, and I think I found 
out a way to make the compiler to find the quoted optimization 
without doing it manually.

You just have to move the sqr calculations to where the data is 
still nested:

import std.experimental.all;
import std.datetime.stopwatch : AutoStart, StopWatch;

alias then(alias a)=(r)=>map!a(r).joiner;
void main(){
     auto sw = StopWatch(AutoStart.no);
     int total;

     if (true)
     {   sw.start;
         scope (success) sw.stop;
         auto triples=recurrence!"a[n-1]+1"(1L)

.then!(z=>iota(1,z+1).then!(x=>iota(x,z+1).map!(y=>tuple(x,y,z))))
             .filter!((t)=>t[0]^^2+t[1]^^2==t[2]^^2)
             .until!(t=>t[2] >= 500);
         triples.each!((x,y,z){ total += x+y+z; });
     }

     writefln("Old loop time is %s microseconds", 
sw.peek.total!"usecs"); // 118_614
     sw.reset;

     if (true)
     {   sw.start;
         scope (success) sw.stop;
         auto triples=recurrence!"a[n-1]+1"(1L)
             .then!(z=>iota(1,z+1).then!
             (   x=>iota(x,z+1)
                 .map!(y=> tuple(x,y,z))
                 .filter!((t)=>t[0]^^2+t[1]^^2==t[2]^^2))
             )
             .until!(t=>t[2] >= 500);
         triples.each!((x,y,z){ total += x+y+z; });
     }

     writefln("New loop time is %s microseconds", 
sw.peek.total!"usecs"); // 21_936
     writeln(total); // to force the compiler to do the 
calculations
     return;
}

See, no manual caching of the squares. And the improvement is 
over 5X (dub --build=release --compiler=ldc2), which should bring 
it close to the other versions in the blog entry.