topN using a heap
Xinok via Digitalmars-d
digitalmars-d at puremagic.com
Mon Jan 18 18:42:21 PST 2016
On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu
wrote:
> Of course not. I think this back-and-forth takes away from the
> gist of things. So let me summarize what has happened:
>
> 1. topN was reportedly slow. It was using a random pivot. I
> made it use getPivot (deterministic) instead of a random pivot
> in https://github.com/D-Programming-Language/phobos/pull/3921.
> getPivot is also what sort uses.
>
> 2. Now that both functions use getPivot, I set out to improve
> it in
> https://github.com/D-Programming-Language/phobos/pull/3922. The
> problem is I couldn't make getPivot impure; pure functions
> already call sort, so making it impure would have been a
> regression.
>
> 3. So I made getPivot use just a bit of randomness taken from
> the length of the range.
>
> 4. sort was and is attackable before all of these changes
>
> 5. So now we have pure topN and sort (subject to the range
> offering pure primitives) but they are both attackable.
>
> 6. PRNGs don't have any other downside than making functions
> impure.
>
> The way I see it we have these solutions:
>
> (a) make topN still use a random pivot. That way there's no
> regression. Then proceed and make sort() avoid quadratic cases
> in other ways, e.g. switch to heapsort if performance degrades.
>
> (b) Improve sort() first, then apply a similar strategy to
> improving topN. Do not use RNGs at all.
>
> (c) Seek randomness in other places, e.g. address of elements,
> data hashes etc. Come with a solution that may still be
> attacked in narrow cases but make that unlikely enough to be a
> real concern.
>
>
> Andrei
To be clear, sort is NOT attackable. Introsort is used for
unstable sorting which begins with quick sort but falls back to
heap sort after too many recursions to maintain O(n log n)
running time. Timsort is used for stable sorting which is a
variant of merge sort but still runs in O(n log n) time. In
either case, you're guaranteed to have O(n log n) running time in
the worst case.
On the other hand, someone can improve upon quick sort by
choosing better pivots but be careful not to add too much
overhead. A couple years ago, I tried choosing the pivot from a
median of five but the result was as much as 10% slower. One idea
is to begin initially choosing better pivots, but once the
sublists fall below a certain length, simply choose the pivot
from a median of three to avoid the extra overhead.
As for topN, there are two approaches I'm aware of that are
deterministic without resorting to impurities or RNGs. The first
is introselect which is similar to introsort in that it has a
fall back algorithm. The other approach is to choose the pivot
from a median of medians. My idea is a variant of the latter
approach: Begin by choosing the pivot from a median of five. If
you continuously choose bad pivots, take a larger sample of
elements to choose the pivot from. Keep growing the sample as
necessary.
Speaking of RNGs, they're technically pure as long as you always
use the same seed. So what if we generated a random seed at the
start of the process, but then reused that same seed over and
over in pure functions for the duration of the process?
More information about the Digitalmars-d
mailing list