topN using a heap

Sun Jan 17 03:55:17 PST 2016

On Sunday, 17 January 2016 at 02:37:48 UTC, Timon Gehr wrote:
> On 01/17/2016 03:09 AM, Andrei Alexandrescu wrote:
>> On 1/16/16 8:00 PM, Timon Gehr wrote:
>>> The implementation falls back to topNHeap whenever k is 
>>> within the first
>>> or last ~n/8 elements and therefore is Ω(n log n) on average.
>>
>> Correct. The pedantic approach would be to only do that when k 
>> is up to
>> log(n). -- Andrei
>>
>
> Ivan's analysis suggests that even something significantly 
> larger, like n/log(n)² might work as an upper bound for k.
> I don't think that meeting the documented runtime bounds 
> amounts to pedantry. (Having linear average running time of 
> course does not even imply linear expected running time, as is 
> still written in the documentation.)

I'd suggest being able to switch between implementations.

Recall that, when sorting, we have SwapStrategy.stable which is, 
well, stable, but additionally guarantees n log n with a larger 
constant (MergeSort).

And we have SwapStrategy.unstable which uses Introsort, which, in 
turn, has smaller constant on sane inputs.

Here, Andrei's suggestion is to also have two implementations, 
let us call them TopNStrategy.quickSelect and TopNStrategy.heap.

The quickSelect one is O(n) on sane inputs but O(n^2) on an 
artificial worst case.

The heap implementation is also O(n) when k is close to 0 or n, 
but O(n log n) otherwise.  What's important is that it is also 
O(n log n) worst case.

The current proposal switches between strategies based on a 
heuristic (k < n/8 or k > 7n/8 if I understand correctly), which 
may not be the best strategy in all cases.

So, what can be done is to introduce TopNStrategy.auto which is 
the default and uses the (current or better) heuristic to switch 
between strategies, but also leave a way to explicitly select one 
of the two strategies, just like one can do now with sort!(..., 
SwapStrategy.whatWeExplicitlyWant).

All the code for both strategies is already there.  Each strategy 
has its benefits not covered by the other (quickSelect is faster 
for k~n and sane inputs, heap is faster for k close to boundary 
and special inputs).  So, provide the default but let the user 
choose.

Ivan Kazmenko.