Worst-case performance of quickSort / getPivot
Chris Cain
clcain at uncg.edu
Sun Nov 17 13:23:42 PST 2013
On Sunday, 17 November 2013 at 16:57:19 UTC, Andrei Alexandrescu
wrote:
> Then, people may want to sort records by e.g. Lastname,
> Firstname, or index a text by words. For names we'd need some
> census data or phonebook.
Sure.
As a source of data, app_c.csv from
http://www.census.gov/genealogy/www/data/2000surnames/ -> File B:
Surnames...
Test case:
---
import std.stdio, std.algorithm, std.random, std.range, std.conv,
std.datetime;
// Name generator constants
enum NumberOfPossibleNames = 1_000_000;
enum NumberOfRandomNames = 1_000_000;
enum SortStrategy = SwapStrategy.unstable;
void main() {
auto data = File("app_c.csv").byLine
.drop(1) // First line is just column names
.take(NumberOfPossibleNames)
.map!(e => e.text.split(","))
.array;
auto names = data.map!(e => e[0]).array;
auto proportions = data.map!(e => e[2].to!size_t).array;
auto rnd = Random(50);
auto randomNames = fastDice(rnd, proportions)
.take(NumberOfRandomNames)
.map!(i => names[i])
.array;
StopWatch sw = AutoStart.yes;
randomNames.sort!("a < b", SortStrategy)();
sw.stop();
writeln(randomNames.length, " names sorted in ",
sw.peek.msecs, " msecs using ", SortStrategy, " sort");
}
struct FastDice(Rng, SearchPolicy pol = SearchPolicy.gallop) {
SortedRange!(size_t[]) _propCumSumSorted;
size_t _sum;
size_t _front;
Rng* _rng;
this(ref Rng rng, size_t[] proportions) {
size_t[] _propCumSum = proportions.save.array;
_rng = &rng;
size_t mass = 0;
foreach(ref e; _propCumSum) {
mass += e;
e = mass;
}
_sum = _propCumSum.back;
_propCumSumSorted = assumeSorted(_propCumSum);
popFront();
}
void popFront() {
immutable point = uniform(0, _sum, *_rng);
assert(point < _sum);
_front = _propCumSumSorted.lowerBound!pol(point).length;
}
enum empty = false;
@property
auto front() {
return _front;
}
}
auto fastDice(Rng)(ref Rng rng, size_t[] proportions) {
return FastDice!(Rng)(rng, proportions);
}
auto fastDice(size_t[] proportions) {
return fastDice(rndGen, proportions);
}
---
Results (using `-O -inline -release -noboundschecks` on my
computer):
unstable sort: 738 msecs
stable sort: 1001 msecs
Also, to do this (in reasonable time) I had to create a new range
which I called "FastDice" ... it does the same as
std.random.dice, but is intended for cases where you'll be
throwing dice throws often on the same data, so it does a bit of
precomputation (creating a cumulative sum array) and allows for
binary search/gallop to reduce the time to come up with each
throw. I opted for gallop in this since the data is sorted in
such a way that most common names come first.
It needs a bit of work to make it actually generic, but it's a
good start and I'll just say it's WTFPL code, so use it for
whatever. It'll be especially good for generating test cases like
I have done above.
FWIW, FastDice takes ~400ms to generate all those dice throws. I
did a back-of-the-envelope calculation on using dice, and just
the time saved caching the sum saved maybe 30 minutes (No, I
didn't wait that long, I stopped after 5 and wrote FastDice) of
time each run. And the time saved using a gallop search instead
of linear search is pretty significant as well (difference
between n and log n time).
More information about the Digitalmars-d
mailing list