Yet another strike against the current AA implementation

Sun Apr 26 21:50:23 PDT 2009

== Quote from Christopher Wright (dhasenan at gmail.com)'s article
> Andrei Alexandrescu wrote:
> > 2. I haven't measured, but the cost of the indirect call is large enough
> > to make me suspect that opApply is not as efficient as it's cracked to
> > be, even when compared with an iterator.
> When you know the type beforehand or can use templates, that is, rather
> than wrapping your range struct in a wrapper class. If you can't use a
> template for whatever reason, ranges are going to suck -- three virtual
> calls rather than one.
> I don't usually care sufficiently about performance to worry about
> whether a call is virtual or not, but you brought that issue up before.
> And I imagine that, most of the time, you will know the range type in
> advance.

Rule number 1 of all performance discussions:  Always measure if possible.

import std.stdio, std.perf;

struct Direct {
    uint n;

    uint front() {
        return n;
    }

    void popFront() {
        n++;
    }

    bool empty() {
        return n >= 1_000_000_000;
    }
}

struct Apply {

    int opApply(int delegate(ref uint) dg) {
        int res = 0;
        foreach(i; 0U..1_000_000_000) {
            res = dg(i);
            if(res) break;
        }
        return res;
    }
}

class Virtual {
    uint n;

    uint front() {
        return n;
    }

    void popFront() {
        n++;
    }

    bool empty() {
        return n >= 1_000_000_000;
    }
}

void main() {
    scope pc = new PerformanceCounter;
    pc.start;
    foreach(elem; Direct.init) {}
    pc.stop;
    writeln("Direct:  ", pc.milliseconds);
    pc.start;
    auto v = new Virtual;
    foreach(elem; v) {}
    pc.stop;
    writeln("Virtual:  ", pc.milliseconds);
    pc.start;
    foreach(elem; Apply.init) {}
    pc.stop;
    writeln("opApply:  ", pc.milliseconds);
}

Output:
Direct:  2343
Virtual:  5695
opApply:  3014

Bottom line is that these pretty much come in the order you would expect them to,
but there are no particularly drastic differences between any of them.  To put
these timings in perspective, 5700 ms for 1 billion iterations is roughly (on a
2.7 GHz machine) 15 clock cycles per iteration.  How often does anyone really have
code that is performance critical *and* where the contents of the loop don't take
long enough to dwarf the 15 clock cycles per iteration loop overhead *and* you
need the iteration to be polymorphic?