Worst-case performance of quickSort / getPivot

Sun Nov 17 13:23:42 PST 2013

On Sunday, 17 November 2013 at 16:57:19 UTC, Andrei Alexandrescu 
wrote:
> Then, people may want to sort records by e.g. Lastname, 
> Firstname, or index a text by words. For names we'd need some 
> census data or phonebook.

Sure.

As a source of data, app_c.csv from 
http://www.census.gov/genealogy/www/data/2000surnames/ -> File B: 
Surnames...

Test case:
---
import std.stdio, std.algorithm, std.random, std.range, std.conv, 
std.datetime;

// Name generator constants
enum NumberOfPossibleNames = 1_000_000;
enum NumberOfRandomNames = 1_000_000;

enum SortStrategy = SwapStrategy.unstable;

void main() {
     auto data = File("app_c.csv").byLine
             .drop(1) // First line is just column names
             .take(NumberOfPossibleNames)
             .map!(e => e.text.split(","))
             .array;
     auto names = data.map!(e => e[0]).array;
     auto proportions = data.map!(e => e[2].to!size_t).array;

     auto rnd = Random(50);
     auto randomNames = fastDice(rnd, proportions)
             .take(NumberOfRandomNames)
             .map!(i => names[i])
             .array;

     StopWatch sw = AutoStart.yes;
     randomNames.sort!("a < b", SortStrategy)();
     sw.stop();

     writeln(randomNames.length, " names sorted in ",
         sw.peek.msecs, " msecs using ", SortStrategy, " sort");
}

struct FastDice(Rng, SearchPolicy pol = SearchPolicy.gallop) {
     SortedRange!(size_t[]) _propCumSumSorted;
     size_t _sum;
     size_t _front;
     Rng* _rng;

     this(ref Rng rng, size_t[] proportions) {
         size_t[] _propCumSum = proportions.save.array;
         _rng = &rng;

         size_t mass = 0;
         foreach(ref e; _propCumSum) {
             mass += e;
             e = mass;
         }

         _sum = _propCumSum.back;

         _propCumSumSorted = assumeSorted(_propCumSum);

         popFront();
     }

     void popFront() {
         immutable point = uniform(0, _sum, *_rng);
         assert(point < _sum);

         _front = _propCumSumSorted.lowerBound!pol(point).length;
     }

     enum empty = false;

     @property
     auto front() {
         return _front;
     }
}

auto fastDice(Rng)(ref Rng rng, size_t[] proportions) {
     return FastDice!(Rng)(rng, proportions);
}

auto fastDice(size_t[] proportions) {
     return fastDice(rndGen, proportions);
}
---

Results (using `-O -inline -release -noboundschecks` on my 
computer):
unstable sort: 738 msecs
stable sort: 1001 msecs

Also, to do this (in reasonable time) I had to create a new range 
which I called "FastDice" ... it does the same as 
std.random.dice, but is intended for cases where you'll be 
throwing dice throws often on the same data, so it does a bit of 
precomputation (creating a cumulative sum array) and allows for 
binary search/gallop to reduce the time to come up with each 
throw. I opted for gallop in this since the data is sorted in 
such a way that most common names come first.

It needs a bit of work to make it actually generic, but it's a 
good start and I'll just say it's WTFPL code, so use it for 
whatever. It'll be especially good for generating test cases like 
I have done above.

FWIW, FastDice takes ~400ms to generate all those dice throws. I 
did a back-of-the-envelope calculation on using dice, and just 
the time saved caching the sum saved maybe 30 minutes (No, I 
didn't wait that long, I stopped after 5 and wrote FastDice) of 
time each run. And the time saved using a gallop search instead 
of linear search is pretty significant as well (difference 
between n and log n time).