Worst-case performance of quickSort / getPivot

Sun Nov 17 08:57:19 PST 2013

On 11/17/13 6:20 AM, Chris Cain wrote:
> I'm not suggesting my benchmark be the only one; if we're going to use
> pseudorandom data (I'm not certain we could actually get "realistic
> data" that would serve us that much better) we might as well have
> different test cases that stress the sort routine in different ways.
> Obviously, using the real genome would be preferable to generating some
> (since it's actually truly "real" data, just used in an unorthodox way)
> but there's a disadvantage to attaching a 4.6MB file to a benchmarking
> setup. Especially if more might come.

OK, since I see you have some interest...

You said nobody would care to actually sort genome data. I'm aiming for 
data that's likely to be a good proxy for tasks people are interested in 
doing.

For example, people may be interested in sorting floating-point numbers 
resulting from sales, measurements, frequencies, probabilities, and 
whatnot. Since most of those have a Gaussian distribution, a corpus with 
Gaussian-distributed measurements would be nice.

Then, people may want to sort things by date/time. Depending on the 
scale the distribution is different - diurnal cycle, weekly cycle, 
seasonal cycle, secular ebbs and flows etc. I'm unclear on what would be 
a good set of data. For sub-day time ranges uniform distribution may be 
appropriate.

Then, people may want to sort records by e.g. Lastname, Firstname, or 
index a text by words. For names we'd need some census data or 
phonebook. For general text sorting we can use classic texts such as 
Alice in Wonderland or the King James Bible (see 
http://corpus.canterbury.ac.nz/descriptions/). Sorting by word length is 
a possibility (word lengths are probably Gaussian-distributed).

Uniform random data is also a baseline, not terribly representative, but 
worth keeping an eye on. In fact uniform data is unfairly rigged in 
quicksort's favor: any pivot is likely to be pretty good, and there are 
no sorted runs that often occur in real data.

Andrei