[OT] Efficient file structure for very large lookup tables?

Tue Dec 17 11:47:17 PST 2013

On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely 
> large
> lookup tables? (Extremely large as in > 10 million entries, 
> with keys
> and values roughly about 100 bytes each.) The structure must 
> support
> efficient adding and lookup of entries, as these two operations 
> will be
> very frequent.
>
> I did some online research, and it seems that hashtables 
> perform poorly
> on disk, because the usual hash functions cause random 
> scattering of
> related data (which are likely to be access with higher temporal
> locality), which incurs lots of disk seeks.
>
> I thought about B-trees, but they have high overhead (and are a 
> pain to
> implement), and also only exhibit good locality if table 
> entries are
> accessed sequentially; the problem is I'm working with 
> high-dimensional
> data and the order of accesses is unlikely to be sequential. 
> However,
> they do exhibit good spatial locality in higher-dimensional 
> space (i.e.,
> if entry X is accessed first, then the next entry Y is quite 
> likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
>
>
> T

As a first attempt could you use a key-value database (like REDIS 
if you have enough memory to fit everything in)?  Or is that out 
of the question.

Another question is can your queries be batched?  If that is the 
case and your data is bigger than your available memory, then try 
Googling "Lars Arge Buffer Tree" which might work well. However, 
if you thought implementing a B-tree was going to be painful, 
that might not appeal to you.  If you don't want to implement 
that yourself you could look at TPIE:

http://www.madalgo.au.dk/tpie/

Although it is in C++.

If I had to design something quick on the spot, my first guess 
would be to use a grid on the first two dimensions and then bin 
the 'points' or keys within each grid square and build a simpler 
structure on those.  This won't work so well though for really 
high dimension data or if the 'points' are randomly distributed.

Also, what exactly do you mean by "in that space" when you say:

"if entry X is accessed first, then the next entry Y is quite 
likely to be close to X in that space".

Do you mean that the value of Y in the next dimension is 
numerically close (or expected to be) to X?

Cheers,

Craig