topN using a heap
Ivan Kazmenko via Digitalmars-d
digitalmars-d at puremagic.com
Sun Jan 17 12:32:28 PST 2016
On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu
wrote:
> On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
>> The average case is O(n + (k log n log k)) for small enough k.
>> So, any k
>> below roughly n / log^2 (n) makes the second summand less than
>> the first.
>
> I don't understand how you derived the average case complexity,
> and I don't understand how you derived the bound for k.
>
> Regarding the complexity: for all k (small or large), it takes
> O(k) to build a heap. Then in the worst case we do (n - k) heap
> insertions, i.e. total complexity is O(k + (n - k) * log(k)).
> Is that correct?
Yeah, I have the same notion.
> If k is any fraction of n, the worst case complexity comes to
> O(n + n log n) i.e. O(n log n). By your calculation, the
> average complexity comes to O(n log^2 n), i.e. greater than the
> worst case. So there's a mistake somewhere.
The upper bound O(n + k log k log n) is correct, but it's tight
only for small k. I'll explain below.
> Could you please clarify matters? Thanks!
Here is a more verbose version.
Suppose for simplicity that the array consists real numbers of
the same continuous random distribution.
(A discrete random distribution would complicate the analysis a
bit, since the probability of two elements being equal would
become greater than zero.)
Then, element a_i is the minimum of a_1, a_2, ..., a_i with
probability 1/i.
Moreover, it is the second minimum with probability 1/i, the
third minimum with probability 1/i, ..., the maximum of a_1, a_2,
..., a_i with the same probability 1/i.
To prove this, we can show that, if we look at the ordering
permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} <
... a_{p_i}, each of the i! possible permutations is equally
probable.
What immediately follows is that when we track k smallest
elements and consider element i>k, the probability that it will
change the k smallest elements encountered so far is k/i.
Summing that for i from k+1 to n, we get the expected number of
heap insertions:
(k/(k+1)) + (k/(k+2)) + ... + k/n.
Now, this is just k multiplied by:
(1/1 + 1/2 + 1/3 + ... + 1/n) *minus*
(1/1 + 1/2 + 1/3 + ... + 1/k).
As these are harmonic series, the first line is asymptotically
equal to log n, and the second line asymptotically equal to log k.
To summarize, the expected number of heap insertions for an array
of random reals is k (log n - log k), and the complexity is O(n +
k log k (log n - log k)).
What I did here for simplicity is just omit the "minus logarithm
of k" part to get an upper bound.
For small enough k, the bound is tight.
For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) /
2.
On the other hand, when k=Theta(n), log n - log k = log(n/k) =
Theta(1).
Ivan Kazmenko.
More information about the Digitalmars-d
mailing list