topN using a heap

Mon Jan 18 18:42:21 PST 2016

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
> Of course not. I think this back-and-forth takes away from the 
> gist of things. So let me summarize what has happened:
>
> 1. topN was reportedly slow. It was using a random pivot. I 
> made it use getPivot (deterministic) instead of a random pivot 
> in https://github.com/D-Programming-Language/phobos/pull/3921. 
> getPivot is also what sort uses.
>
> 2. Now that both functions use getPivot, I set out to improve 
> it in 
> https://github.com/D-Programming-Language/phobos/pull/3922. The 
> problem is I couldn't make getPivot impure; pure functions 
> already call sort, so making it impure would have been a 
> regression.
>
> 3. So I made getPivot use just a bit of randomness taken from 
> the length of the range.
>
> 4. sort was and is attackable before all of these changes
>
> 5. So now we have pure topN and sort (subject to the range 
> offering pure primitives) but they are both attackable.
>
> 6. PRNGs don't have any other downside than making functions 
> impure.
>
> The way I see it we have these solutions:
>
> (a) make topN still use a random pivot. That way there's no 
> regression. Then proceed and make sort() avoid quadratic cases 
> in other ways, e.g. switch to heapsort if performance degrades.
>
> (b) Improve sort() first, then apply a similar strategy to 
> improving topN. Do not use RNGs at all.
>
> (c) Seek randomness in other places, e.g. address of elements, 
> data hashes etc. Come with a solution that may still be 
> attacked in narrow cases but make that unlikely enough to be a 
> real concern.
>
>
> Andrei

To be clear, sort is NOT attackable. Introsort is used for 
unstable sorting which begins with quick sort but falls back to 
heap sort after too many recursions to maintain O(n log n) 
running time. Timsort is used for stable sorting which is a 
variant of merge sort but still runs in O(n log n) time. In 
either case, you're guaranteed to have O(n log n) running time in 
the worst case.

On the other hand, someone can improve upon quick sort by 
choosing better pivots but be careful not to add too much 
overhead. A couple years ago, I tried choosing the pivot from a 
median of five but the result was as much as 10% slower. One idea 
is to begin initially choosing better pivots, but once the 
sublists fall below a certain length, simply choose the pivot 
from a median of three to avoid the extra overhead.

As for topN, there are two approaches I'm aware of that are 
deterministic without resorting to impurities or RNGs. The first 
is introselect which is similar to introsort in that it has a 
fall back algorithm. The other approach is to choose the pivot 
from a median of medians. My idea is a variant of the latter 
approach: Begin by choosing the pivot from a median of five. If 
you continuously choose bad pivots, take a larger sample of 
elements to choose the pivot from. Keep growing the sample as 
necessary.

Speaking of RNGs, they're technically pure as long as you always 
use the same seed. So what if we generated a random seed at the 
start of the process, but then reused that same seed over and 
over in pure functions for the duration of the process?