Worst-case performance of quickSort / getPivot
Chris Cain
clcain at uncg.edu
Sun Nov 17 06:20:34 PST 2013
On Sunday, 17 November 2013 at 07:19:26 UTC, Andrei Alexandrescu
wrote:
> On 11/16/13 9:21 PM, Chris Cain wrote:
>> That said, it might also be reproduced "well enough" using a
>> random
>> generator to create similar strings to sort, but the basic
>> idea is
>> there. I just like using real genomes for performance testing
>> things :)
>
> I am hoping for some more representative corpora, along the
> lines of http://sortbenchmark.org/. Some data that we can use
> as good proxies for typical application usage.
>
> Andrei
I think I get what you're saying, but sortbenchmark.org uses
completely pseudorandom (but reproducable) entries that I don't
think are representative of real data either:
(using gensort -a minus the verification columns)
---
AsfAGHM5om
~sHd0jDv6X
uI^EYm8s=|
Q)JN)R9z-L
o4FoBkqERn
*}-Wz1;TD-
0fssx}~[oB
...
---
Most places use very fake data as proxies for real data. It's
better to have something somewhat structured and choose data
that, despite not being real data, stresses the benchmark in a
unique way.
I'm not suggesting my benchmark be the only one; if we're going
to use pseudorandom data (I'm not certain we could actually get
"realistic data" that would serve us that much better) we might
as well have different test cases that stress the sort routine in
different ways. Obviously, using the real genome would be
preferable to generating some (since it's actually truly "real"
data, just used in an unorthodox way) but there's a disadvantage
to attaching a 4.6MB file to a benchmarking setup. Especially if
more might come.
Anyway, it's a reasonable representation of "data that has no
discernable order that can occasionally take some time to
compare." Think something like sorting a list of customer records
by name. If they're ordered by ID, then the names would not
likely have a discernable pattern and the comparison between
names might be "more expensive" because some names can be common.
We can do "more realistic" for that type of scenario, if you'd
like. I could look at a distribution for last names/first names
and generate fake names to sort in a reasonable approximation of
a distribution of real names. I'm not certain the outcome would
change that much.
More information about the Digitalmars-d
mailing list