protocol for using InputRanges
monarch_dodra
monarchdodra at gmail.com
Tue Mar 25 13:56:46 PDT 2014
On Tuesday, 25 March 2014 at 20:15:32 UTC, Walter Bright wrote:
> It's pretty clear that:
>
> 1. the protocol is COMPLETELY undocumented and undefined.
http://dlang.org/phobos/std_range.html#isInputRange
The semantics of an input range (not checkable during
compilation) are assumed to be the following (r is an object of
type R):
r.empty returns false iff there is more data available in the
range.
r.front returns the current element in the range. It may return
by value or by reference. Calling r.front is allowed only if
calling r.empty has, or would have, returned false.
r.popFront advances to the next element in the range. Calling
r.popFront is allowed only if calling r.empty has, or would have,
returned false.
> We want to appeal to the high performance coders. To maximize
> performance, ranges should be optimized to the inner loop case,
> which is:
>
> while (!r.empty) { auto e = r.front; ... e ...;
> r.popFront(); }
This makes the assumption that r.front is copy constructible at
all. It also makes the assumption that you want to operate on a
copy, rather than the actual element in the range.
Finally, it means having to declare a local object: It merely
means shifting the burden of caching from one context to another.
If the object is large, chances are you are better off just
calling front instead of making a copy. Especially if the loop is
trivial.
If you want high performance, then arguably, just provide a O(1)
front, and a O(1) empty.
Also, certain ranges, such as "filter" *must* access the front of
the previous range more than once. Unless you are suggesting we
add a field for it to cache the result of the previous range?
> With the additional proviso that ranges are passed to
> algorithms by value, so they should be cheap to copy. Cheap to
> copy usually means them being small.
Unfortunately yes. That said, any range that does anything will
have at least two fields, one of which is a slice, or comparable
to in terms of size, so it's going to be big anyways. So you
*are* better off passing by ref if you can regardless, unless
your range is *really* trivial.
I agree that range sizes can be a problem.
> A) I know that my range is not empty, so I can skip calling
> empty.
>
> Since front is guaranteed to succeed if !empty, this puts a
> requirement on many ranges that they have a non-trivial
> constructor that 'primes the pump'. Of course, priming may
> fail, and so construction may throw, which is not good design.
If the prime fails, then the range can simply be marked as empty.
Then if you decide to skip calling empty anyways, it's your own
fault.
> And lastly, it means that lazy ranges will be required to read
> the first element, even if the range isn't then subsequently
> used, which defeats what one would expect from a lazy range.
I'm not yet convinced of adding special code for ranges that
aren't used. I've heard of these kinds of ranges, but I've
observed that when you declare one, it almost always ends up
being used. I don't think we should waste efforts on this rare
usecase.
As for "Lazy", in range terms, it mostly only means you calculate
things element at once, as you go up the range chain. As opposed
to processing the entire input data, one transformation at a time.
Evaluating an element "on use" as opposed to "1 instruction
before use" doesn't make much of a change in this context.
> All this saves for the user is one call to empty for the entire
> algorithm, at a cost incurred with every iteration. I.e. it
> selects O(n) to save O(1).
If code that was actually meant to *do* something was in
popFront() to begin with, then there'd be no extra overhead.
I've found that if you are creative enough, you can usually
design the range in such a way that it works efficiently,
lazilly, and without flags.
> Hence, I propose that the protocol for using input ranges be
> defined as:
>
> while (!r.empty) { auto e = r.front; ... e ...;
> r.popFront(); }
>
> This makes it possible to build pipelines that are firehoses
> with no kinks or constrictions in them. It optimizes for the
> inner loop case, not boundary cases.
I get where you are coming from, but it's simply not manageable
in a generic fashion, if you want to be able to preserve all the
power and the diversity of the ranges we have.
The protocol you are suggesting would prevent us from doing a lot
of the lovely things that ranges empowers us with.
More information about the Digitalmars-d
mailing list