topN using a heap

Wed Sep 21 01:16:34 PDT 2016

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu 
wrote:
>
> So let me summarize what has happened:
>
> 1. topN was reportedly slow. It was using a random pivot. I 
> made it use getPivot (deterministic) instead of a random pivot 
> in https://github.com/D-Programming-Language/phobos/pull/3921. 
> getPivot is also what sort uses.
>
> [snip]
>
Not completely clear from this thread what the conclusion was wrt 
getting known topN performance issues addressed. From pull 
requests it appears identified fixes are in current release 
versions of the DMD/LDC. However, I hit significant issues on one 
of the first large data sets I tried. Not an artificial data, but 
one with very skewed distributions of values (a google ngram 
file).

Details here:  https://issues.dlang.org/show_bug.cgi?id=16517. 
Includes test program, url for the ngram file.

A brief summary - Data file is a TSV file with 3 numeric fields, 
a bit over 10 million values each with different distribution 
properties. Used both topN and sort get the median value. Since 
this was median, it topN for the mid-point value, not at one end 
or the other. (This is a specific callout for some of the issues 
identified.)

Timing comparison of sort and topN, times in milliseconds:

           sort      topN
Field 2:   289      1756
Field 3:   285    148793
Field 4:   273    668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar 
behavior is seen for DMD 2.071.2. This makes topN pretty much 
unusable.