[OT] Efficient file structure for very large lookup tables?

Tue Dec 17 11:43:17 PST 2013

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely 
> large
> lookup tables? (Extremely large as in > 10 million entries, 
> with keys
> and values roughly about 100 bytes each.) The structure must 
> support
> efficient adding and lookup of entries, as these two operations 
> will be
> very frequent.
>
> I did some online research, and it seems that hashtables 
> perform poorly
> on disk, because the usual hash functions cause random 
> scattering of
> related data (which are likely to be access with higher temporal
> locality), which incurs lots of disk seeks.
>
> I thought about B-trees, but they have high overhead (and are a 
> pain to
> implement), and also only exhibit good locality if table 
> entries are
> accessed sequentially; the problem is I'm working with 
> high-dimensional
> data and the order of accesses is unlikely to be sequential. 
> However,
> they do exhibit good spatial locality in higher-dimensional 
> space (i.e.,
> if entry X is accessed first, then the next entry Y is quite 
> likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
>
>
> T

A burst trie might work, although it is originally designed for 
text. I haven't had a chance to try one out in code yet but the 
paper by Heinz, Zobel, and Williams is interesting.

Burst Tries: A Fast, Efficient Data Structure for String Keys 
(2002)
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499

Joseph