Worst-case performance of quickSort / getPivot

Sun Nov 17 06:20:34 PST 2013

On Sunday, 17 November 2013 at 07:19:26 UTC, Andrei Alexandrescu 
wrote:
> On 11/16/13 9:21 PM, Chris Cain wrote:
>> That said, it might also be reproduced "well enough" using a 
>> random
>> generator to create similar strings to sort, but the basic 
>> idea is
>> there. I just like using real genomes for performance testing 
>> things :)
>
> I am hoping for some more representative corpora, along the 
> lines of http://sortbenchmark.org/. Some data that we can use 
> as good proxies for typical application usage.
>
> Andrei

I think I get what you're saying, but sortbenchmark.org uses 
completely pseudorandom (but reproducable) entries that I don't 
think are representative of real data either:

(using gensort -a minus the verification columns)
---
AsfAGHM5om
~sHd0jDv6X
uI^EYm8s=|
Q)JN)R9z-L
o4FoBkqERn
*}-Wz1;TD-
0fssx}~[oB
...
---

Most places use very fake data as proxies for real data. It's 
better to have something somewhat structured and choose data 
that, despite not being real data, stresses the benchmark in a 
unique way.

I'm not suggesting my benchmark be the only one; if we're going 
to use pseudorandom data (I'm not certain we could actually get 
"realistic data" that would serve us that much better) we might 
as well have different test cases that stress the sort routine in 
different ways. Obviously, using the real genome would be 
preferable to generating some (since it's actually truly "real" 
data, just used in an unorthodox way) but there's a disadvantage 
to attaching a 4.6MB file to a benchmarking setup. Especially if 
more might come.

Anyway, it's a reasonable representation of "data that has no 
discernable order that can occasionally take some time to 
compare." Think something like sorting a list of customer records 
by name. If they're ordered by ID, then the names would not 
likely have a discernable pattern and the comparison between 
names might be "more expensive" because some names can be common.

We can do "more realistic" for that type of scenario, if you'd 
like. I could look at a distribution for last names/first names 
and generate fake names to sort in a reasonable approximation of 
a distribution of real names. I'm not certain the outcome would 
change that much.