[Issue 13936] groupBy must be redone

via Digitalmars-d-bugs digitalmars-d-bugs at puremagic.com
Thu Jan 8 00:07:13 PST 2015


https://issues.dlang.org/show_bug.cgi?id=13936

--- Comment #8 from Andrei Alexandrescu <andrei at erdani.com> ---
Great going, thx. Let's beat this into shape.

(In reply to hsteoh from comment #7)
> I don't understand your beef against isEquivRelation. It's there precisely
> to give users the choice of having maximal expressive power at a slight
> performance cost, or being restricted to only equivalence relations but have
> better performance.

I'd say offering non-equivalence relations would be desirable only if (a) the
default is to assume the predicate is an equivalence relation, and (b)
implementing it doesn't take focus away from the matter at hand, which is an
efficient groupBy with equivalence relations against input and forward ranges.

> As for only iterating things once, a thought occurred to me today that
> instead of creating a new subrange every time the outer .front is called, we
> can simply keep *both* inner and outer ranges in the same struct, and .front
> simply returns a reference to the inner range. Multiple calls to .front
> returns (a reference to) the same inner range, so popping any of them will
> pop all of them at the same time. The inner range is then only valid until
> the next call to popFront() -- i.e., groupBy will return a transient range.
> This will give you maximum performance.

Yah, but it's not sufficient as you note below.

> However, it forces the inner range to be non-forward -- there is no way to
> .save it without making a copy of both the outer and inner ranges, and once
> you .save it, it detaches from the outer range and no longer influences it
> when popFront is called. There is no way to make .save work without
> completely breaking the idea of single traversal, because if I use .save to
> make n copies of an inner range, then which copy's popFront will influence
> the outer range? And what happens if I call popFront on the outer range and
> then go back to a .save'd copy of the inner range afterwards? There are
> endless nasty corner cases that make this extremely messy.
>
> So there you have it: a fast solution. But it comes at the cost of messy
> semantics (transient outer range, inner ranges have reference semantics and
> cannot be .save'd). The transience of the outer range is quite a major
> concern IMO, because many algorithms in Phobos do not deal with them
> correctly. Expect all the same problems with byLine to turn up all over
> again.

I agree that's not a good solution. We shouldn't code it.

> Honestly, I'd rather make this single-pass implementation a *different*
> function (or use a different compile-time argument) that is opt-in rather
> than opt-out. D's motto after all is "correct by default, fast if you ask
> for it". Naïve use cases of groupBy should return the "safest" behaviour --
> i.e., the current behaviour where nothing is transient and .save has the
> usual semantics. If people find that the performance is too slow, recommend
> groupByFast (or equivalent), with the caveat that they will have to deal
> with the more unconventional semantics in their code.

"It can't be done satisfactorily well" is in a way a given - it's the baseline
we're starting from. It's also not terribly interesting because it precludes
continuing to look into the matter. The trick is to actually find a way to do
it, and here's where I'm relying on your wits.

I think it can be done in a satisfactory manner. There needs to be
communication between individual groups and the mothership (range of ranges).
Did you take a look at that pull request of mine? In there, each group has a
serial number, and the mothership also has a serial number. That way it's easy
to figure which group the mothership is currently in.

Once a group reaches its end, it "calls home" (the mothership) and checks
whether the serial number of the mothership is the same as the serial number of
the group. If so, great - it offers the mothership the O(1) answer for the
mothership's next popFront. If not, it means the mothership has already moved
forward so there's nothing to do.

I feel I'm not explaining this very well. Basically the proper iteration can be
achieved by establishing a simple protocol between the range of ranges and each
individual group.

Does this make sense?

--


More information about the Digitalmars-d-bugs mailing list