An interesting data structure with search time O(sqrt n)

Wed Dec 2 10:27:06 PST 2015

On Wed, 02 Dec 2015 05:58:24 +0000, Navin wrote:

> Nice to see this interesting post and learn.
> 
>   I have a few questions.
> 
> 1) This is offline datastructure since you don't know how the elements
> of the future are going to be ie dynamic. ie later elements from n to 2n
> can break or change your heaps as such in worst case or is it a dynamic
> data structure ?

You can usually change an offline datastructure into an amortized online 
one. (I think.) You can usually change an amortized online algorithm into 
a deamortized one with a series of ugly hacks and large increases in 
constants.

I think you can make insertions work and not be too terrible with an 
amortized algorithm. Something like:

To insert a value X into the square heap, find the largest heap with a 
head equal to X. If there is none, find the smallest heap with a head 
greater than X. If there is none, choose the final heap. If that heap is 
not full, insert X into it. If it is full, trigger a rebalance.

'Rebalance' is an operation we use to ensure that each intermediate heap 
is exactly half full (rounding down) and the final heap is no more than 
half full. We start at a given heap. We calculate its surplus -- the 
number of elements it has in excess of half its capacity. We scan to the 
right (increasing heap sizes) until we find a heap with at least enough 
spare capacity to hold the surpluses of all the prior heaps we've 
scanned. If we reach the end of the square, we create a new heap. This is 
the target heap.

Now we repeatedly bubble up values from lower heaps to higher ones until 
no heap between the one we tried to insert into and the target heap has 
any surplus.

This costs us in the worst case something like O(n * log(n) * sqrt(n)). I 
haven't done the full analysis. But after doing a large rebalance, you 
shouldn't need to do a large rebalance for quite a while, so the average 
case, even on adversarial data, should be not so horrible.

Deletions are simple: find the element and remove it from its heap. This 
can, however, reduce a heap to less than half filled. (This means I can 
insert and delete elements in an adversarial pattern to force all 
elements into one heap.) A similar rebalance algorithm is needed, but 
it's harder on this side because we're using max heaps everywhere and 
need to fill lower heaps.

> 2)  Searching in min or max heaps is bad isn't it ?

Linear time in the worst case. If you're looking for something in a max 
heap that happens to be the smallest element, it can be in any leaf node, 
which gives you half the tree to look through. And the average isn't much 
better.

However, if we can guarantee that an element is in one heap inside this 
square heap, it costs us O(sqrt n) to search that heap since that heap 
has no more than sqrt(n) elements.

> Lets say we choose
> max heaps. Now we have the root as 10^9 in the second last heap ie
> around n^2 elements.  The children of it are 4*10^8 and 5*10^8 . If i'm
> searching for say 4.5 *10^8 my job is easy but if i'm searching for
> 1000, i have to search in both the subtrees and it goes linear and
> becomes around n^2 in the worst case. Did i overlook anything ?
> 
> Instead of heaps, a single sorted array or breaking them into a series
> of sorted arrays ie skip lists kind of stuff  would be fine if it just
> want a offline Data structure ?

A skiplist is great. Problem is, it's random access to memory. That's bad 
enough at the best of times, and it's utter garbage if you're storing 
data on spinning disks. Even if you store it as a sorted list rather than 
a linked list, which means you never insert anything ever, it's O(log n/
B) seeks and reads to find an element. If you store it as a linked list, 
it's O(log n) reads.

A square heap with an insertion algorithm as I describe gives you O(1) 
seeks and O(sqrt n/B) reads.

(Here, B stands for block size, which in this case is the number of data 
elements you will naturally get for free or almost for free after reading 
one byte. When I was doing data structure research, the rule of thumb was 
that a block is roughly 1MB for spinning disks. Divide that by element 
size to get B.

We track seeks separately because they're expensive. 9ms is a reasonable 
seek time. Comparatively, reading data sequentially is extremely cheap, 
especially if you can inform the disk scheduler that you are going to 
perform sequential reads. Seeks are expensive no matter what, even if you 
can inform the scheduler in advance -- and the skiplist doesn't let you 
do that.

Even if you switch to SSDs, SSDs tend to be ~4x faster on sequential 
reads than random reads.)

So if you have 64-byte data structures, for instance, and you've got a 
billion on disk, you have sixteen seeks for a skiplist on average, which 
costs you 144ms. (You can drop that a bit by keeping the top of the 
skiplist in memory.)

The square heap, on the other hand? You've got a total of ~1500 heaps, so 
you can store their offsets and heads in memory. That means you identify 
the heap containing the element you're searching for without touching 
disk, and scanning the heap costs one seek and a scan. You're scanning 
700K elements on average, which means you need to look through forty or 
so blocks.

So the square heap could well be more efficient here. It costs ~3x the 
number of reads, but they're all sequential, so while it's close, the 
square heap probably wins out on average.

But this isn't yet a viable real-world algorithm for most things because 
of amortized inserts that promise to be expensive even after they're 
deamortized. And I haven't even sketched out an algorithm for deletes.

At least range queries will typically be reasonably fast.

> or is this some domain specific data Structure where you only/cre want
> max/min in some sequence ?