An interesting data structure with search time O(sqrt n)
Chris Wright via Digitalmars-d
digitalmars-d at puremagic.com
Wed Dec 2 10:27:06 PST 2015
On Wed, 02 Dec 2015 05:58:24 +0000, Navin wrote:
> Nice to see this interesting post and learn.
>
> I have a few questions.
>
> 1) This is offline datastructure since you don't know how the elements
> of the future are going to be ie dynamic. ie later elements from n to 2n
> can break or change your heaps as such in worst case or is it a dynamic
> data structure ?
You can usually change an offline datastructure into an amortized online
one. (I think.) You can usually change an amortized online algorithm into
a deamortized one with a series of ugly hacks and large increases in
constants.
I think you can make insertions work and not be too terrible with an
amortized algorithm. Something like:
To insert a value X into the square heap, find the largest heap with a
head equal to X. If there is none, find the smallest heap with a head
greater than X. If there is none, choose the final heap. If that heap is
not full, insert X into it. If it is full, trigger a rebalance.
'Rebalance' is an operation we use to ensure that each intermediate heap
is exactly half full (rounding down) and the final heap is no more than
half full. We start at a given heap. We calculate its surplus -- the
number of elements it has in excess of half its capacity. We scan to the
right (increasing heap sizes) until we find a heap with at least enough
spare capacity to hold the surpluses of all the prior heaps we've
scanned. If we reach the end of the square, we create a new heap. This is
the target heap.
Now we repeatedly bubble up values from lower heaps to higher ones until
no heap between the one we tried to insert into and the target heap has
any surplus.
This costs us in the worst case something like O(n * log(n) * sqrt(n)). I
haven't done the full analysis. But after doing a large rebalance, you
shouldn't need to do a large rebalance for quite a while, so the average
case, even on adversarial data, should be not so horrible.
Deletions are simple: find the element and remove it from its heap. This
can, however, reduce a heap to less than half filled. (This means I can
insert and delete elements in an adversarial pattern to force all
elements into one heap.) A similar rebalance algorithm is needed, but
it's harder on this side because we're using max heaps everywhere and
need to fill lower heaps.
> 2) Searching in min or max heaps is bad isn't it ?
Linear time in the worst case. If you're looking for something in a max
heap that happens to be the smallest element, it can be in any leaf node,
which gives you half the tree to look through. And the average isn't much
better.
However, if we can guarantee that an element is in one heap inside this
square heap, it costs us O(sqrt n) to search that heap since that heap
has no more than sqrt(n) elements.
> Lets say we choose
> max heaps. Now we have the root as 10^9 in the second last heap ie
> around n^2 elements. The children of it are 4*10^8 and 5*10^8 . If i'm
> searching for say 4.5 *10^8 my job is easy but if i'm searching for
> 1000, i have to search in both the subtrees and it goes linear and
> becomes around n^2 in the worst case. Did i overlook anything ?
>
> Instead of heaps, a single sorted array or breaking them into a series
> of sorted arrays ie skip lists kind of stuff would be fine if it just
> want a offline Data structure ?
A skiplist is great. Problem is, it's random access to memory. That's bad
enough at the best of times, and it's utter garbage if you're storing
data on spinning disks. Even if you store it as a sorted list rather than
a linked list, which means you never insert anything ever, it's O(log n/
B) seeks and reads to find an element. If you store it as a linked list,
it's O(log n) reads.
A square heap with an insertion algorithm as I describe gives you O(1)
seeks and O(sqrt n/B) reads.
(Here, B stands for block size, which in this case is the number of data
elements you will naturally get for free or almost for free after reading
one byte. When I was doing data structure research, the rule of thumb was
that a block is roughly 1MB for spinning disks. Divide that by element
size to get B.
We track seeks separately because they're expensive. 9ms is a reasonable
seek time. Comparatively, reading data sequentially is extremely cheap,
especially if you can inform the disk scheduler that you are going to
perform sequential reads. Seeks are expensive no matter what, even if you
can inform the scheduler in advance -- and the skiplist doesn't let you
do that.
Even if you switch to SSDs, SSDs tend to be ~4x faster on sequential
reads than random reads.)
So if you have 64-byte data structures, for instance, and you've got a
billion on disk, you have sixteen seeks for a skiplist on average, which
costs you 144ms. (You can drop that a bit by keeping the top of the
skiplist in memory.)
The square heap, on the other hand? You've got a total of ~1500 heaps, so
you can store their offsets and heads in memory. That means you identify
the heap containing the element you're searching for without touching
disk, and scanning the heap costs one seek and a scan. You're scanning
700K elements on average, which means you need to look through forty or
so blocks.
So the square heap could well be more efficient here. It costs ~3x the
number of reads, but they're all sequential, so while it's close, the
square heap probably wins out on average.
But this isn't yet a viable real-world algorithm for most things because
of amortized inserts that promise to be expensive even after they're
deamortized. And I haven't even sketched out an algorithm for deletes.
At least range queries will typically be reasonably fast.
> or is this some domain specific data Structure where you only/cre want
> max/min in some sequence ?
More information about the Digitalmars-d
mailing list