[OT] Efficient file structure for very large lookup tables?

Tue Dec 17 17:09:43 PST 2013

If you can afford to do a one time sort of both the sets of data, then you can do a very simple 
single pass intersect through both sets.  Chances are that wont work.  So, a potential alternative, 
use that approach combined with a smaller delta since last sort, hopefully small enough to fit in 
memory in a nice convenient to probe structure.  Periodically, as the delta grows too big, merge it 
in with the primary data file.

Try looking at the access patterns to see if there's something you can do to adjust them to make the 
storage problem simpler.

On 12/17/13 11:08 AM, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
>
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.
>
> I did some online research, and it seems that hashtables perform poorly
> on disk, because the usual hash functions cause random scattering of
> related data (which are likely to be access with higher temporal
> locality), which incurs lots of disk seeks.
>
> I thought about B-trees, but they have high overhead (and are a pain to
> implement), and also only exhibit good locality if table entries are
> accessed sequentially; the problem is I'm working with high-dimensional
> data and the order of accesses is unlikely to be sequential. However,
> they do exhibit good spatial locality in higher-dimensional space (i.e.,
> if entry X is accessed first, then the next entry Y is quite likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
>
>
> T
>