[OT] Efficient file structure for very large lookup tables?

H. S. Teoh hsteoh at quickfur.ath.cx
Tue Dec 17 15:06:42 PST 2013


On Tue, Dec 17, 2013 at 08:54:17PM +0100, Craig Dillabaugh wrote:
> On Tuesday, 17 December 2013 at 19:24:30 UTC, tn wrote:
> >On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> >>However, they do exhibit good spatial locality in higher-dimensional
> >>space (i.e., if entry X is accessed first, then the next entry Y is
> >>quite likely to be close to X in that space).  Does anybody know of
> >>a good data structure that can take advantage of this fact to
> >>minimize disk accesses?
> >
> >Perhaps some kind of k-d tree? Maybe even some sort of cache
> >oblivious k-d tree to make it more suitable for storage on the
> >disk?
> 
> Its not cache-oblivious buy he could try the KDB tree:
> 
> http://dl.acm.org/citation.cfm?id=582321
> 
> or, again for batched queries the is the
> 
> Bkd-tree (search on Bkd-tree and Lars Arge) - this paper is a type
> of buffer tree so the operations need to be able to be run as a
> batch.

I skimmed over the KDB tree paper, and am reading the Bkd-tree paper
now; it appears that the latter is an improvement over the former, and
has better update performance characteristics. However, it also appears
much more complicated to implement.

Another point I forgot to mention, is that currently I only ever perform
insertion and queries, and the two can actually be combined (basically
the table is kept for uniqueness testing). I don't need to delete
entries from the table. Eventually I might look into that to reduce disk
space requirements, but as far as the actual algorithm is concerned,
it's not necessary. So there's potentially a good amount of room here
for optimization by foregoing the ability to delete entries.

I'm currently considering using simpler, "dumber" file structures for
storing nodes, with an in-memory Bloom filter to eliminate a (hopefully)
large number of I/O roundtrips, since I'm expecting that a good fraction
of queries will return 'not found'.


T

-- 
"A one-question geek test. If you get the joke, you're a geek: Seen on a
California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D.
Wachs - Natural Intelligence, Inc.


More information about the Digitalmars-d mailing list