An interesting data structure with search time O(sqrt n)

Wed Dec 2 03:26:11 PST 2015

On 12/01/2015 11:48 PM, Andrei Alexandrescu wrote:
> On 12/01/2015 12:13 AM, Andrei Alexandrescu wrote:
>> On 11/30/15 9:47 PM, Timon Gehr wrote:
>>> On 12/01/2015 03:33 AM, Timon Gehr wrote:
>>>> On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:
>>>>> So now consider my square heaps. We have O(n) build time (just a bunch
>>>>> of heapifications) and O(sqrt n) search.
>>>>
>>>> How do you build in O(n)? (The initial array is assumed to be
>>>> completely
>>>> unordered, afaict.)
>>>
>>> (I meant to say: There aren't any assumptions on the initial ordering of
>>> the array elements.)
>>
>> That's quite challenging. (My O(n) estimate was off the cuff and
>> possibly wrong.) Creating the structure entails simultaneously solving
>> the selection problem (find the k smallest elements) for several values
>> of k. I'll post here if I come up with something. -- Andrei
>
> OK, I think I have an answer to this in the form of an efficient algorithm.
>
> First off: sizes 1+3+5+7+... seem a great choice, I'll use that for the
> initial implementation (thanks Titus!).
>
> Second: the whole max heap is a red herring - min heap is just as good,
> and in fact better. When doing the search just overshoot by one then go
> back one heap to the left and do the final linear search in there.
>
> So the structure we're looking at is an array of adjacent min-heaps of
> sizes 1, 3, 5, etc. The heaps are ordered (the maximum of heap k is less
> than or equal to the minimum of heap k+1). Question is how do we build
> such an array of heaps in place starting from an unstructured array of
> size n.
>
> One simple approach is to just sort the array in O(n log n). This
> satisfies all properties - all adjacent subsequences are obviously
> ordered, and any subsequence has the min heap property. As an
> engineering approach we may as well stop here - sorting is a widely
> studied and well implemented algorithm. However, we hope to get away
> with less work because we don't quite need full sorting.
>
> Here's the intuition: the collection of heaps can be seen as one large
> heap that has a DAG structure (as opposed to a tree). In the DAG, the
> root of heap k+1 is the child of all leaves of heap k (see
> http://imgur.com/I366GYS which shows the DAG for the 1, 3, 7, and 7 heaps).
>
> Clearly getting this structure to respect the heap property is all
> that's needed for everything to work - so we simply apply the classic
> heapify algorithm to it. It seems it can be applied almost unchanged -
> starting from the end, sift each element down the DAG.
>
> This looks efficient and minimal; I doubt there's any redundant work.
> However, getting bounds for complexity of this will be tricky. Classic
> heapify is tricky, too - it seems to have complexity O(n log n) but in
> fact has complexity O(n) - see nice discussion at
> http://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity.
> When applying heapify to the DAG, there's more restrictions and the
> paths are longer, so a sliver more than O(n) is expected.
>
> Anyway, this looks ready for a preliminary implementation and some more
> serious calculations.
>

Assume the initial array is sorted from largest to smallest. This will 
be the worst case for this construction algorithm, as each element will 
be sifted all the way down to leaf level of the last heap.

For simplicity, let's assume that n = m² for m odd, such that the sizes 
of the heaps are 1,3,…,m. To sift some element down one entire heap of 
size i takes ≥ log₂(i) steps.

Ignoring the work performed in the heaps each elements starts in (which 
is obviously at most O(n) in total), we can lower bound the performed 
work; for the heap of size j, we sift j elements down all the following 
heaps of sizes i=j+1,i=j+2,…,i=m:

∑ⱼ[j∈{1,3,…,m}]·j·∑ᵢ[j+1≤i≤m]·log(i)
=
∑ᵢ∑ⱼ[j∈{1,3,…,m}]·[j+1≤i≤m]·j·log(i)
=
∑ᵢ[2≤i≤m]·log(i)·∑ⱼ[j∈{1,3,…,m}]·[j+1≤i] j
=
∑ᵢ[2≤i≤m]·log(i)·∑ⱼ∑ₖ[j=2·k+1]·[1≤j≤m]·[j≤i-1] j
=
∑ᵢ[2≤i≤m]·log(i)·∑ⱼ∑ₖ[j=2·k+1]·[1≤2·k+1≤i-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·∑ₖ[1≤2·k+1≤i-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·∑ₖ[0≤k≤⌊i/2⌋-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·⌊i/2⌋²
≥
Ω(m³)
=
Ω(n^(3/2)).

I.e. sorting is asymptotically faster, and probably also in practice. 
Moreover, the construction algorithm would be slower even if sifting 
down one entire heap would run in constant time.

However, for your adapted data structure with min-heaps, implementing 
insert and remove in O(√n̅ log n) is now easy using your "sifting in a 
DAG" approach.

One way to prove a useful lower bound on the time it must take to build 
might be to observe that building recursively can be used for sorting.

> One more interesting thing: the heap heads are sorted, so when
> searching, the heap corresponding to the searched item can be found
> using binary search. That makes that part of the search essentially
> negligible - the lion's share will be the linear search on the last
> mile. In turn, that suggests that more heaps that are smaller would be a
> better choice. (At an extreme, if we have an array of heaps each
> proportional to log(n), then we get search time O(log n) even though the
> array is not entirely sorted.)
>
>
> Andrei
>

The linear search can have way better locality, so it is likely that 
actually the binary search dominates for some data sets.