[Issue 13936] groupBy must be redone

Mon Jan 12 11:48:08 PST 2015

https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #26 from hsteoh at quickfur.ath.cx ---
For some reason I couldn't access the dpaste yesterday. But anyway, it works
now.

The first thing I note, is that this still does not address the situation where
the group is incompletely iterated, then the user decides to pop the outer
range. In that case you duplicate almost all of the effort of popping the
entire group again.

The choice of implementation also imposes an artificial limitation on group
length to uint.max, and would insert an artificial group boundary at that point
even if predicate() still holds true. This is unlikely to ever happen, but
still, it would be nice to have some insurance for it in documentation and in
throwing an exception should this case get triggered (but unfortunately that
breaks nothrow, which could be a show-stopper). Or use size_t, which defers the
problem to the more unlikely distant future.

A similar thing happens with wrapped-around group serial numbers, but that's a
far less likely case to cause problems 'cos the user would have to keep a
reference to a group around for uint.max groups, which realistically won't
happen unless he was deliberately trying to shoot his own foot.

Now on to the implementation:

- What on earth is going on with GroupBy.groupStart and GroupBy.groupCurrent?!
Shouldn't there just be a single reference to the underlying range, which gets
advanced to groupNext once the current Group is finished? Currently this code
looks really suspect because .groupStart and .groupCurrent seem to be redundant
with each other, yet .empty checks .groupCurrent whereas .front copies
.groupStart.

- GroupBy.front passes the current range by value which Group stores by value.
This will cause broken behaviour if the range has reference semantics:
Group.popFront will cause subsequent calls to GroupBy.front to return a
truncated group. Previously-saved Group's will also break once the mothership
advances to the next group, because they use groupStart to evaluate the
predicate, which would have changed if the underlying range has reference
semantics. Basically, .save should be used everywhere except where you can
justify not using it, otherwise you will almost certainly get broken boundary
cases.

- OTOH, if the range has very short groups, then most of the performance will
be bottlenecked on all those .save's every time .front is accessed. We should
probably profile this thing, with at least one test case with very long groups,
and one test case with very short (1-2 element) groups. And of course, test it
to death.

- Group doesn't implement .save? Shouldn't that be possible with this design?

- Group.this assumes groupStart is non-empty. Should assert in case future
changes break GroupBy.front.

There are probably more issues, but I haven't looked closely enough yet. :-P  I
think the design seems OK (barring the issue with incompletely-iterated Groups,
but I'm not sure if fixing that wouldn't introduce too much additional
complexity), but the implementation is atrocious! :-P  Plus, it totally doesn't
work with input ranges, as you noted. In fact, input ranges totally don't need
refcounting or any of that serial number fanciness -- those would just be
useless overhead. I'd even propose that for input ranges we should just use my
current implementation. :-D  I'm almost ready to bet that performance is better
in that case. They could just be side-by-side overloads.

--