topN using a heap

Mon Jan 18 06:22:48 PST 2016

On Monday, 18 January 2016 at 12:00:10 UTC, Ivan Kazmenko wrote:
> On Sunday, 17 January 2016 at 22:20:30 UTC, Andrei Alexandrescu 
> wrote:
>> All - let me know how things can be further improved. Thx!
>>
> Here goes the test which shows quadratic behavior for the new 
> version:
> http://dpaste.dzfl.pl/e4b3bc26c3cf
> (dpaste kills the slow code before it completes the task)
>
> The inspiration is the paper "A Killer Adversary for Quicksort":
> http://www.cs.dartmouth.edu/~doug/mdmspe.pdf
> (I already mentioned it on the forums a while ago)
>
> Ivan Kazmenko.

Perhaps I should include a textual summary as well.

The code on DPaste starts by constructing an array of Elements of 
size MAX_N; in the code, MAX_N is 50_000.  After that, we run the 
victim function on our array.  Here, the victim is topN (array, 
MAX_N / 2); it could be sort (array) or something else.

An Element contains, or rather, pretends to contain, an int 
value.  Here is how Element is special: the result of comparison 
for two Elements is decided on-the-fly.  An Element can be either 
UNDECIDED or have a fixed value.  Initially, all elements are 
UNDECIDED.  When we compare two Elements and at least one of them 
has a fixed value, the comparison is resolved naturally, and 
UNDECIDED element is greater than any fixed element.  When we 
compare two UNDECIDED elements, the one which participated more 
in the comparisons so far gets a fixed value: greater than any 
other value fixed so far, but still less than UNDECIDED.  This 
way, the results of old comparisons are consistent with the new 
fixed value.

Now, what do we achieve by running the victim function?  Turns 
out that the algorithms using the idea of QuickSort or 
QuickSelect tend to make most comparisons against their current 
pivot value.  Our Element responds to that by fixing the pivot to 
one of the lowest available values.  After that, a partition 
using such pivot will have only few, O(1), elements before the 
pivot, and the rest after the pivot.  In total, this will lead to 
quadratic performance.

After running the victim function on our array of Elements (which 
- careful here - already takes quadratic time), we reorder them 
in their original order (to do that, each Element also stores its 
original index).

Now, we can re-run the algorithm on the array obtained so far.  
If the victim function is (strongly) pure, it will inevitably 
make the same comparisons in the same order.  The only difference 
is that their result will already be decided.

Alternatively, we can make an int array of the current values in 
our array of Elements (also in their original order).  Running 
the victim function on the int array must also make the same 
(quadratic number of) comparisons in the same order.

Ivan Kazmenko.