Efficient idiom for fastest code

Wed May 23 10:55:02 UTC 2018

On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions 
wrote:
> In some cases the decision holds for continuous ranges. For 
> some 0 <= n <= N the decision is constant, but n is 
> arbitrary(determined by unknown factors at compile time).
>
> One can speed up the routine by using something akin to a 
> simplified strategy pattern where one uses 
> functions/delegates/lambdas to code a faster version without 
> the test:
>
>
> for(int i = 0; i < N; i++)
> {
>     d = (){ if(decision(i)) A; else d = () { B; };
>     d();
> }
>
> this code basically reduces to
>
> for(int i = 0; i < N; i++)
> {
>     B;
> }
>
> Once the decision fails, which we assume once it fails it 
> always fails in this particular case.
>
> Therefor, once the decision fails it kicks in the "faster" 
> code. Suppose decision is very slow.

I would just do
>int i=0;
>for(;decision(i) && i < N; i++)
>{
>    A;
>}
>for(;i < N; i++)
>{
>    B;
>}

This could be turned to a mixin template with something like this:
>mixin template forSplit(alias condition, alias A, alias B)
>{
>    void execute()
>    {
>        int i = 0;
>        for (; condition(i) && i < N; i++)
>        {
>            A();
>        }
>        for (; i < N; i++)
>        {
>            B();
>        }
>    }
>}

and to use it in code (assuming N is defined in the scope)
>mixin forSplit!((int i)=>(decision(i)), {A;}, {B;}) loop;
>loop.execute();

I have't measured anything, but I would assume that delegates 
come with an overhead that you just don't need here. In fact when 
trying to use
> auto d = (int i) {};
> d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
> for(int i = 0; i < N; i++)
> {
>     d(i);
> }
All I got was "cannot access frame of function D main" which sums 
up my experiences with lambdas in D so far.

While PGO and branch predictor are good, they don't help much 
here. Not executing an expression is out of scope for them. All 
they do is prevent pipeline flushes.
Also I think you overestimate what the compiler can do. My 
decision function to do some testing was this:
>bool decision(int a) pure
>out (result) {
>    	assert(result == (a < 10));
>} do {
>    import std.algorithm, std.range;
> 
>    // stolen from https://dlang.org/library/std/parallelism.html
>    enum n = 1_000_000;
>    enum delta = 1.0 / n;
>    alias getTerm = (int i)
>    {
>        immutable x = ( i - 0.5 ) * delta;
>        return delta / ( 1.0 + x * x ) ;
>    };
>    immutable pi = 4.0 * reduce!"a + b"(n.iota.map!getTerm);
> 
>    return a < 3*pi;
>}
With N=100 I got a speedup of ~10 (ldc -O3 -release). Even though 
this function is pure and could be optimized a lot. It calculated 
pi for every single call. And optimizing the decision function 
isn't even the point of that question.