[Issue 13936] groupBy must be redone

Mon Jan 12 13:23:17 PST 2015

https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #29 from Andrei Alexandrescu <andrei at erdani.com> ---
(In reply to hsteoh from comment #26)
> The first thing I note, is that this still does not address the situation
> where the group is incompletely iterated, then the user decides to pop the
> outer range. In that case you duplicate almost all of the effort of popping
> the entire group again.

It's trivial - just add the adjustment in the Group destructor as well.

> The choice of implementation also imposes an artificial limitation on group
> length to uint.max, and would insert an artificial group boundary at that
> point even if predicate() still holds true. This is unlikely to ever happen,
> but still, it would be nice to have some insurance for it in documentation
> and in throwing an exception should this case get triggered (but
> unfortunately that breaks nothrow, which could be a show-stopper). Or use
> size_t, which defers the problem to the more unlikely distant future.

Come on, that's nitpicking. Make that size_t or ulong. Implementation may use
an extra bool, too. It's not a golden standard, it's a proof of concept! 

> A similar thing happens with wrapped-around group serial numbers, but that's
> a far less likely case to cause problems 'cos the user would have to keep a
> reference to a group around for uint.max groups, which realistically won't
> happen unless he was deliberately trying to shoot his own foot.
> 
> Now on to the implementation:
> 
> - What on earth is going on with GroupBy.groupStart and
> GroupBy.groupCurrent?! Shouldn't there just be a single reference to the
> underlying range, which gets advanced to groupNext once the current Group is
> finished? Currently this code looks really suspect because .groupStart and
> .groupCurrent seem to be redundant with each other, yet .empty checks
> .groupCurrent whereas .front copies .groupStart.

groupStart is there only for its front. Calling front doesn't use groupStart.

> - GroupBy.front passes the current range by value which Group stores by
> value. This will cause broken behaviour if the range has reference
> semantics: Group.popFront will cause subsequent calls to GroupBy.front to
> return a truncated group. Previously-saved Group's will also break once the
> mothership advances to the next group, because they use groupStart to
> evaluate the predicate, which would have changed if the underlying range has
> reference semantics. Basically, .save should be used everywhere except where
> you can justify not using it, otherwise you will almost certainly get broken
> boundary cases.

The implementation is intended for forward ranges, which are the hard case
here. If there are other issues, could you give an example? 

> - OTOH, if the range has very short groups, then most of the performance
> will be bottlenecked on all those .save's every time .front is accessed. We
> should probably profile this thing, with at least one test case with very
> long groups, and one test case with very short (1-2 element) groups. And of
> course, test it to death.

That would be great indeed.

> - Group doesn't implement .save? Shouldn't that be possible with this design?

It should.

> - Group.this assumes groupStart is non-empty. Should assert in case future
> changes break GroupBy.front.

That seems like a safe assumption to me. Of course assert is appropriate.

> There are probably more issues, but I haven't looked closely enough yet. :-P

I'm surprised. It's a very short codebase - 128 lines all told and uses no
trick and no subterfuge. All straight code.

> I think the design seems OK (barring the issue with incompletely-iterated
> Groups, but I'm not sure if fixing that wouldn't introduce too much
> additional complexity), but the implementation is atrocious! :-P

I find that hard to agree with, especially since you had predicted dire
difficulties and contortions several times in your long posts dedicated to
explaining how it can't be done. Frankly after 128 lines of code destroy
assertions that took several walls of text, I was shooting for "Oh okay, I see
what you mean."

> Plus, it
> totally doesn't work with input ranges, as you noted.

Input non-forward ranges are the easy case.

> In fact, input ranges
> totally don't need refcounting or any of that serial number fanciness --
> those would just be useless overhead. I'd even propose that for input ranges
> we should just use my current implementation. :-D  I'm almost ready to bet
> that performance is better in that case. They could just be side-by-side
> overloads.

That was my intent as well.

--