topN using a heap

Andrei Alexandrescu via Digitalmars-d digitalmars-d at puremagic.com
Sat Sep 24 11:22:51 PDT 2016


On 09/23/2016 04:45 PM, Jon Degenhardt wrote:
> That's a very nice result. Both the absolute numbers and that all three
> sets achieve similar performance, as they different distribution
> characteristics.

Got curious so I tried to patch my ldc installation (0.17.1 (DMD 
v2.068.2, LLVM 3.8.0)), but sadly that didn't work out because sorting.d 
uses the new syntax in various places.

Probably the best baseline is the equivalent C++ code using the mature 
implementation of nth_element, see 
http://paste.ofcode.org/ieZPdchjzTXbDcG3LvsYBP (also pasted at the end 
of this message). Here's what I got: 
http://paste.ofcode.org/6feyBLRiJtieHdZ3bmGaUW, see also text below.

The topN code produced with dmd is faster in virtually all cases (has 
parity for -f 4, which I suspect is a degenerate case with most elements 
equal, which exercises only a small part of the algorithm). For C++ I 
used -O3 -DNDEBUG, any other flags I should use?

As an aside, I enjoy poking fun at the stunning inefficiency of 
iostreams in my public talks; they delivered again :o).


Andrei

=========================== shell
$ g++ -O3 -DNDEBUG -omedian_by_sort_cpp issue16517.cpp
$ g++ -O3 -DNDEBUG -DTOPN -omedian_by_topn_cpp issue16517.cpp
$ dmd -O -inline -release -ofmedian_by_sort -boundscheck=off 
issue16517.d
$ dmd -O -inline -release -version=topn -ofmedian_by_topn 
-boundscheck=off issue16517.d
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | 
./median_by_sort_cpp
median (via sort): 1972; lines: 10512769; total time (ms): 3419.11; sort 
time (ms): 314.086
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 1972; lines: 10512769; total time (ms): 1462; sort 
time (ms): 399
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 1972; lines: 10512769; total time (ms): 
3255.41; nth_element time (ms): 40.676
$ cut -f 2  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 1972; lines: 10512769; total time (ms): 1211; topN 
time (ms): 32
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp
median (via sort): 3; lines: 10512769; total time (ms): 2949.11; sort 
time (ms): 297.814
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 3; lines: 10512769; total time (ms): 1351; sort time 
(ms): 382
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 3; lines: 10512769; total time (ms): 2316.75; 
nth_element time (ms): 65.041
$ cut -f 3  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 3; lines: 10512769; total time (ms): 973; topN time 
(ms): 59
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort_cpp
median (via sort): 2; lines: 10512769; total time (ms): 2614.47; sort 
time (ms): 351.285
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_sort
median (via sort): 2; lines: 10512769; total time (ms): 1269; sort time 
(ms): 361
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn_cpp
median (via nth_element): 2; lines: 10512769; total time (ms): 2443.17; 
nth_element time (ms): 76.211
$ cut -f 4  /tmp/googlebooks-eng-all-1gram-20120701-0 | ./median_by_topn
median (via topN): 2; lines: 10512769; total time (ms): 998; topN time 
(ms): 77
===========================

=========================== issue16517.cpp
#include <iostream>
#include <cstdio>
#include <ctime>
#include <vector>
#include <algorithm>

using namespace std;

int main()
{
     clock_t swTotal, swSort;
     swTotal = clock();
     vector<double> values;
     double num;
     while (cin >> num) {
         values.push_back(num);
     }

     size_t medianIndex = values.size() / 2;
     swSort = clock();
     #ifdef TOPN
         const char* method = "nth_element";
         nth_element(values.begin(), values.begin() + medianIndex, 
values.end());
     #else
         const char* method = "sort";
         sort(values.begin(), values.end());
     #endif
     clock_t done = clock();

     printf("median (via %s): %g; lines: %zu; total time (ms): %g; %s 
time (ms): %g\n",
              method, values[medianIndex], values.size(),
              (done - swTotal) * 1000. / CLOCKS_PER_SEC,
              method, (done - swSort) * 1000. / CLOCKS_PER_SEC);
}
===========================

=========================== issue16517.d
import std.stdio;
import std.array : appender;
import std.algorithm : sort, topN;
import std.conv;
import std.range;
import std.datetime: StopWatch;

void main()
{
     StopWatch swTotal, swSort;
     swTotal.start;
     Appender!(double[]) values;
     foreach (line; stdin.byLine)
         values.put(line.to!double);

     size_t medianIndex = values.data.length/2;
     swSort.start;
     version(topn)
     {
         topN(values.data, medianIndex);
         auto method = "topN";
     }
     else {
         sort(values.data);
         auto method = "sort";
     }
     swTotal.stop;
     swSort.stop;

     writefln("median (via %s): %g; lines: %d; total time (ms): %d; %s 
time (ms): %d",
              method, values.data[medianIndex], values.data.length, 
swTotal.peek.msecs,
              method, swSort.peek.msecs);
}
===========================



More information about the Digitalmars-d mailing list