Worst-case performance of quickSort / getPivot

Sat Nov 16 21:21:23 PST 2013

On Sunday, 17 November 2013 at 04:14:22 UTC, Andrei Alexandrescu 
wrote:
> Probably a good step forward would be to hook a sort 
> benchmarking corpus to our unittests. What are the consecrated 
> corpora?

I'll kick off the suggestions with this:

File to provide "real" data:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.fna

 From http://www.genome.jp/kegg-bin/show_organism?org=eco -> 
RefSeq -> NC_000913.fna

(despite this being real data, typically you don't actually sort 
genomes like this ... it's simply "more interesting" than a 
pseudo-randomized dataset)

Code:

---
import std.range, std.algorithm, std.stdio, std.array, std.conv, 
std.datetime;

enum strategy = SwapStrategy.stable;

void main() {
     auto arr = File("NC_000913.fna").byLine
             .drop(1) // Trim the first line because it's just 
metadata
             /*.take(50)*/
             .join
             .chunks(50)
             .map!text
             .array;
     StopWatch sw = StopWatch(AutoStart.yes);
     arr.sort!("a < b", strategy)();
     sw.stop();
     writeln(text(arr.length, " elements sorted in ", 
sw.peek.msecs, " msecs"));
}
---

On my system, this shows unstable sort to be ~2x faster than 
stable sort (which is to be expected of data with no discernible 
patterns).

The "key" thing this is testing is the effectiveness of the sort 
on data that often takes longer to compare (with such a small 
alphabet, many strings have a common prefix).

That said, it might also be reproduced "well enough" using a 
random generator to create similar strings to sort, but the basic 
idea is there. I just like using real genomes for performance 
testing things :)