Worst-case performance of quickSort / getPivot
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Sun Nov 17 08:57:19 PST 2013
On 11/17/13 6:20 AM, Chris Cain wrote:
> I'm not suggesting my benchmark be the only one; if we're going to use
> pseudorandom data (I'm not certain we could actually get "realistic
> data" that would serve us that much better) we might as well have
> different test cases that stress the sort routine in different ways.
> Obviously, using the real genome would be preferable to generating some
> (since it's actually truly "real" data, just used in an unorthodox way)
> but there's a disadvantage to attaching a 4.6MB file to a benchmarking
> setup. Especially if more might come.
OK, since I see you have some interest...
You said nobody would care to actually sort genome data. I'm aiming for
data that's likely to be a good proxy for tasks people are interested in
doing.
For example, people may be interested in sorting floating-point numbers
resulting from sales, measurements, frequencies, probabilities, and
whatnot. Since most of those have a Gaussian distribution, a corpus with
Gaussian-distributed measurements would be nice.
Then, people may want to sort things by date/time. Depending on the
scale the distribution is different - diurnal cycle, weekly cycle,
seasonal cycle, secular ebbs and flows etc. I'm unclear on what would be
a good set of data. For sub-day time ranges uniform distribution may be
appropriate.
Then, people may want to sort records by e.g. Lastname, Firstname, or
index a text by words. For names we'd need some census data or
phonebook. For general text sorting we can use classic texts such as
Alice in Wonderland or the King James Bible (see
http://corpus.canterbury.ac.nz/descriptions/). Sorting by word length is
a possibility (word lengths are probably Gaussian-distributed).
Uniform random data is also a baseline, not terribly representative, but
worth keeping an eye on. In fact uniform data is unfairly rigged in
quicksort's favor: any pivot is likely to be pretty good, and there are
no sorted runs that often occur in real data.
Andrei
More information about the Digitalmars-d
mailing list