topN using a heap

Andrei Alexandrescu via Digitalmars-d digitalmars-d at puremagic.com
Tue Jan 19 05:10:42 PST 2016


On 01/18/2016 09:44 PM, Ivan Kazmenko wrote:
> On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:
>> 4. sort was and is attackable before all of these changes
>>
>> (b) Improve sort() first, then apply a similar strategy to improving
>> topN. Do not use RNGs at all.
>
> Since point 4 is in fact already fixed a good while ago, my suggestion
> would be (b): to do the same for topN.
>
> Introselect (https://en.wikipedia.org/wiki/Introselect), already
> mentioned in this thread, uses median-of-medians to achieve worst case
> O(n) performance if we recurse too deep.  There is already an issue
> suggesting to implement it:
> https://issues.dlang.org/show_bug.cgi?id=12141 (std.algorithm: implement
> deterministic topN).
>
> In fact, the O(n log n) heap approach as it is implemented now could be
> faster than O(n) median-of-medians approach on reasonable inputs.  So,
> someone will have to implement median-of-medians and, well, measure.
>
> At the very least, googling for "median of medians in practice" and such
> yields the tag wiki from StackOverflow.com: "The constant factor in the
> O(n) is large, and the algorithm is not commonly used in practice.".

Yah, I also think heap/double-heap topN is better; median-of-medians 
(MoM) is theoretically nice but practically less so. Though there are 
two things that could help MoM in D: (a) median of five (core part of 
MoM) is relatively cheap to compute with six comparisons; (b) we can use 
stride() to compute median without needing to copy a fifth of the input 
or to complicate the algorithm - just recurse on the stride! All in all 
this might be a good algo to implement and test.

So anyhow we need the "intro" part. I'll get to that soon.


Andrei



More information about the Digitalmars-d mailing list