topN using a heap

Sun Jan 17 12:32:28 PST 2016

On Sunday, 17 January 2016 at 16:06:31 UTC, Andrei Alexandrescu 
wrote:
> On 01/17/2016 06:41 AM, Ivan Kazmenko wrote:
>> The average case is O(n + (k log n log k)) for small enough k. 
>> So, any k
>> below roughly n / log^2 (n) makes the second summand less than 
>> the first.
>
> I don't understand how you derived the average case complexity, 
> and I don't understand how you derived the bound for k.
>
> Regarding the complexity: for all k (small or large), it takes 
> O(k) to build a heap. Then in the worst case we do (n - k) heap 
> insertions, i.e. total complexity is O(k + (n - k) * log(k)). 
> Is that correct?

Yeah, I have the same notion.

> If k is any fraction of n, the worst case complexity comes to 
> O(n + n log n) i.e. O(n log n). By your calculation, the 
> average complexity comes to O(n log^2 n), i.e. greater than the 
> worst case. So there's a mistake somewhere.

The upper bound O(n + k log k log n) is correct, but it's tight 
only for small k.  I'll explain below.

> Could you please clarify matters? Thanks!

Here is a more verbose version.

Suppose for simplicity that the array consists real numbers of 
the same continuous random distribution.
(A discrete random distribution would complicate the analysis a 
bit, since the probability of two elements being equal would 
become greater than zero.)
Then, element a_i is the minimum of a_1, a_2, ..., a_i with 
probability 1/i.
Moreover, it is the second minimum with probability 1/i, the 
third minimum with probability 1/i, ..., the maximum of a_1, a_2, 
..., a_i with the same probability 1/i.
To prove this, we can show that, if we look at the ordering 
permutation p_1, p_2, ..., p_i such that of a_{p_1} < a_{p_2} < 
... a_{p_i}, each of the i! possible permutations is equally 
probable.

What immediately follows is that when we track k smallest 
elements and consider element i>k, the probability that it will 
change the k smallest elements encountered so far is k/i.
Summing that for i from k+1 to n, we get the expected number of 
heap insertions:
(k/(k+1)) + (k/(k+2)) + ... + k/n.

Now, this is just k multiplied by:
(1/1 + 1/2 + 1/3 + ... + 1/n) *minus*
(1/1 + 1/2 + 1/3 + ... + 1/k).
As these are harmonic series, the first line is asymptotically 
equal to log n, and the second line asymptotically equal to log k.
To summarize, the expected number of heap insertions for an array 
of random reals is k (log n - log k), and the complexity is O(n + 
k log k (log n - log k)).

What I did here for simplicity is just omit the "minus logarithm 
of k" part to get an upper bound.
For small enough k, the bound is tight.
For example, when k=sqrt(n), log n - log k = log(n/k) = (log n) / 
2.
On the other hand, when k=Theta(n), log n - log k = log(n/k) = 
Theta(1).

Ivan Kazmenko.