[Issue 13936] groupBy must be redone

via Digitalmars-d-bugs digitalmars-d-bugs at puremagic.com
Mon Jan 5 18:19:21 PST 2015


https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #5 from Andrei Alexandrescu <andrei at erdani.com> ---
(In reply to hsteoh from comment #4)
> @andrei:

Thanks for answering. OK, let's see.

> The current implementation of groupBy only traverses the range once if it's
> an input range (non-forward).

Yes - couldn't even do otherwise. Input non-forward ranges move along all
together (that's somewhat implied but let's assume it's always the case). So
groupBy relies on the fact that if r1 and r2 are copies of the same
input/non-forward range, calling popFront in one also pops the front in the
other.

That's efficient and all. The challenge here is to make sure the same goes for
forward ranges. The simplest way to achieve that is by using a pointer, i.e.
instead of using `Range`, use `Range*`. Then copying the pointer around will
refer to the same actual range. Calling popFront through one will also pop the
other.

But that produces garbage so a RefCounted!Range comes to mind. That would
probably be a nice solution. In my opinion what we have right now penalizes
forward ranges too much to be workable.

> The difference between equivalence and non-equivalence relations is actually
> important; it's not a "fussy" issue. A non-equivalence relation requires
> that you always evaluate the predicate on two adjacent elements, because you
> cannot assume transitivity: the fact that pred(r[0], r[1]) and pred(r[1],
> r[2]) holds, does not imply pred(r[0], r[2]) also holds, so you have to
> .save every previous position of the range instead of just once per subrange
> (determining the end of the subrange simply by evaluating the predicate with
> the head element of that subrange), when you know that pred is an
> equivalence relation.

We must go with efficiency for the common classic relational case. I think a
distinct place we do not want to be in is: "Well for cool relational stuff you
can use groupBy, it's very general and actually goes outside what relational
algebra can do. However, if you want speed don't use it; you gotta use manual
loops."

We must handle the typical relational algebra treatment, and handle it well
enough that handmade approaches are unnecessary. If someone has a weird
predicate and a weird requirement they can use adjacentFind or handmade loops
etc. We don't need to cater for them. Pushing two ranges along is inefficient
and therefore unacceptable.

> Your fast/slow range idea is flawed, because the outer range's .front must
> return an independent instance of the subrange. This is an artifact of the
> range API. There is no way for the outer range to keep track of how far the
> subrange has moved ahead without introducing some kind of reference
> semantics to it, which becomes messy if the caller invokes .front multiple
> times.

So there is no way, or there's a messy way? Big difference. I say there is a
way. 

> For example, if the user calls .front twice and .save's both
> subranges, then randomly iterates one or the other, what should happen?
> Should the outer range's 'fast' range track the first subrange, the second,
> or both, or neither (because they are .save'd copies)? Does advancing the
> first copy of the subrange also advance the second? (Which, btw, breaks the
> semantics of .save)?
> 
> The only way for your fast/slow range idea to work is to make subranges
> non-forward, so that there is only one copy of the 'fast' range that gets
> advanced no matter which copy of it you call popFront on. But this means
> users can no longer call .save on subranges, which sux if they handed you a
> forward range precisely for that purpose.

This can be handled, it's indeed not simple. There is an implementation in
https://github.com/D-Programming-Language/phobos/pull/1186 that is working but
a bit oddly and is probably more complicated than it needs to. 

Anyhow, the larger point is we want groupBy to work as close to native speed
for the common case of going once through all groups and with an equivalence
relation. For more complex uses, Just Fine(tm) at a small extra cost should be
okay. Is this reasonable?

--


More information about the Digitalmars-d-bugs mailing list