Algorithms & opApply

Fri Aug 27 06:45:07 PDT 2010

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s article
> On 8/23/10 19:26 PDT, dsimcha wrote:
> > == Quote from bearophile (bearophileHUGS at lycos.com)'s article
> >> The gentle and a-lot-working David Simcha has converted one of my bug reports
> > into an enhancement request:
> >> http://d.puremagic.com/issues/show_bug.cgi?id=4264
> >> Currently (SVN) only array() is able to digest iterables that define opApply
> > only, but I think a few more spread support will be very good.
> >> In the thread of that 4264 there are plenty of comments and views of the
> > situation. Opinions welcome, both here and there.
> >> Bye,
> >> bearophile
> >
> > This one didn't get much attention, possibly because the bug report is too long,
> > too complicated and too indirect.  Let me reiterate the basic points inline.
> >
> > I've been working heavily on debugging and polishing std.range and std.algorithm.
> >   (I've fixed 14 Bugzilla bugs plus several unlisted bugs and enhancements in
these
> > modules since last DMD release.)  One issue that's been nagging me for a while,
> > and has apparently been nagging Bearophile, too is that opApply-based objects are
> > not supported at all, even where supporting them would be completely feasible.
> > Examples are std.algorithm.map, std.algorithm.filter and std.algorithm.chain.  I'm
> > thinking of generalizing std.range and std.algorithm to work with any foreach-able
> > object where possible, including opApply-based iteration.  My concerns are:
> >
> > 1.  Bug 2443 makes it impossible to propagate ref to higher order objects
correctly.
> >
> > 2.  Building higher order iterables with opApply might get so slow that it's
> > virtually unusable, due to multiple layers of delegate calls.  Then again, this
> > might be premature optimization for a dumb compiler, as LDC can inline through at
> > least one level of opApply delegate calls.  I haven't tested to see if it can do
> > more than this.
> >
> > 3.  Ranges are the flagship way of iterating through an object in D, as they
> > should be.  opApply has its uses, but they're relatively few and far between.  I'm
> > wondering if anyone besides bearophile and I think supporting opApply is worth the
> > effort and potential bloat, if Walter cares enough to fix Bug 2443, and if Andrei
> > considers it to be a reasonable part of std.range/std.algorithm's design.
> Thanks for this note.
> First, I think those interested in the relative merits and demerits of
> ranges vs. opApply should read this:
> http://okmij.org/ftp/papers/LL3-collections-enumerators.txt. The author,
> Oleg Kiselyov - an excellent PL theorist - makes a lot of excellent
> points (but also misses a few; I plan to write a rebuttal to that
> article a some point).
> Clearly opApply (and internal iteration in general) has many merits that
> are difficult to rival with ranges. The converse is also true. So
> perhaps it's compelling to have both.
> One concern regarding opApply is speed of manipulation through an
> indirect (delegate) call. It would be great to have some benchmarks that
> give us a good baseline for discussion. David, would you want to put a
> few together?
> Andrei

Here's one benchmark and the results, using SHOO's awesome new StopWatch module
that just landed in SVN a few days ago:

import std.stdio, std.functional, std.stopwatch, std.traits, std.range,
    std.algorithm;

struct OpApplyFilter(alias pred, Range) {
    alias unaryFun!pred fun;
    Range range;

    int opApply(int delegate(ref ForeachType!Range) dg) {
        int res;

        foreach(elem; range) if(fun(elem)) {
            res = dg(elem);
            if(res) break;
        }

        return res;
    }
}

auto opApplyFilter(alias pred, Range)(Range range) {
    return OpApplyFilter!(pred, Range)(range);
}

struct OpApplyMap(alias pred, Range) {
    alias unaryFun!pred fun;
    Range range;

    int opApply(int delegate(ref ForeachType!Range) dg) {
        int res;

        foreach(elem; range) {
            auto mapped = fun(elem);
            res = dg(mapped);
            if(res) break;
        }

        return res;
    }
}

auto opApplyMap(alias pred, Range)(Range range) {
    return OpApplyMap!(pred, Range)(range);
}

void main() {
    auto myRange = iota(100_000_000);

    // Keep the compiler from excessively optimizing.
    int sum;

    auto sw = new StopWatch(StopWatch.AutoStart.yes);

    foreach(elem; map!"a * a"(filter!"a & 1"(myRange))) {
        sum += elem;
    }

    sw.stop;
    writeln("Range-based:  ", sw.peek.milliseconds);

    sw.reset();
    sw.start();

    foreach(elem; opApplyMap!"a * a"(opApplyFilter!"a & 1"(myRange))) {
        sum += elem;
    }

    sw.stop;
    writeln("opApply-based:  ", sw.peek.milliseconds);

    sw.reset();
    sw.start();

    foreach(elem; opApplyMap!"a * a"(myRange)) {
        sum += elem;
    }

    sw.stop;
    writeln("opApply-based, Map only:  ", sw.peek.milliseconds);

    sw.reset();
    sw.start();

    foreach(elem; opApplyFilter!"a & 1"(
    opApplyMap!"a + 1"(opApplyMap!"a * a"(myRange)))) {
        sum += elem;
    }

    sw.stop;
    writeln("opApply-based, two maps:  ", sw.peek.milliseconds);

    // This is to force the value of sum to be used to prevent
    // overly aggressive optimizations by DMD.
    writeln("Ignore this:  ", sum);
}

Results:

Range-based:  404.445
opApply-based:  718.914
opApply-based, Map only:  524.051
opApply-based, two maps:  1501.92
Ignore this:  1158518272

So it seems like adding more and more layers of stacking causes each layer of
stacking to have progressively more overhead.  This makes sense because each layer
of stacking is likely to have a progressively worse effect on the CPU pipeline.
The CPU can probably speculatively execute around one layer of indirect calls, but
not a whole bunch.

BTW, from looking at disassemblies it seems like for some strange reason the
predicate functions are never getting inlined.