Efficient idiom for fastest code
Malte
no at valid.mail
Wed May 23 10:55:02 UTC 2018
On Wednesday, 23 May 2018 at 02:24:08 UTC, IntegratedDimensions
wrote:
> In some cases the decision holds for continuous ranges. For
> some 0 <= n <= N the decision is constant, but n is
> arbitrary(determined by unknown factors at compile time).
>
> One can speed up the routine by using something akin to a
> simplified strategy pattern where one uses
> functions/delegates/lambdas to code a faster version without
> the test:
>
>
> for(int i = 0; i < N; i++)
> {
> d = (){ if(decision(i)) A; else d = () { B; };
> d();
> }
>
> this code basically reduces to
>
> for(int i = 0; i < N; i++)
> {
> B;
> }
>
> Once the decision fails, which we assume once it fails it
> always fails in this particular case.
>
> Therefor, once the decision fails it kicks in the "faster"
> code. Suppose decision is very slow.
I would just do
>int i=0;
>for(;decision(i) && i < N; i++)
>{
> A;
>}
>for(;i < N; i++)
>{
> B;
>}
This could be turned to a mixin template with something like this:
>mixin template forSplit(alias condition, alias A, alias B)
>{
> void execute()
> {
> int i = 0;
> for (; condition(i) && i < N; i++)
> {
> A();
> }
> for (; i < N; i++)
> {
> B();
> }
> }
>}
and to use it in code (assuming N is defined in the scope)
>mixin forSplit!((int i)=>(decision(i)), {A;}, {B;}) loop;
>loop.execute();
I have't measured anything, but I would assume that delegates
come with an overhead that you just don't need here. In fact when
trying to use
> auto d = (int i) {};
> d = (int i){ if(decision(i)) A; else d = (int i) { B; };};
> for(int i = 0; i < N; i++)
> {
> d(i);
> }
All I got was "cannot access frame of function D main" which sums
up my experiences with lambdas in D so far.
While PGO and branch predictor are good, they don't help much
here. Not executing an expression is out of scope for them. All
they do is prevent pipeline flushes.
Also I think you overestimate what the compiler can do. My
decision function to do some testing was this:
>bool decision(int a) pure
>out (result) {
> assert(result == (a < 10));
>} do {
> import std.algorithm, std.range;
>
> // stolen from https://dlang.org/library/std/parallelism.html
> enum n = 1_000_000;
> enum delta = 1.0 / n;
> alias getTerm = (int i)
> {
> immutable x = ( i - 0.5 ) * delta;
> return delta / ( 1.0 + x * x ) ;
> };
> immutable pi = 4.0 * reduce!"a + b"(n.iota.map!getTerm);
>
> return a < 3*pi;
>}
With N=100 I got a speedup of ~10 (ldc -O3 -release). Even though
this function is pure and could be optimized a lot. It calculated
pi for every single call. And optimizing the decision function
isn't even the point of that question.
More information about the Digitalmars-d-learn
mailing list