[OT] Efficient file structure for very large lookup tables?

Tue Dec 17 14:46:28 PST 2013

On 17.12.2013. 20:08, H. S. Teoh wrote:
> Another OT thread to pick your brains. :)
> 
> What's a good, efficient file structure for storing extremely large
> lookup tables? (Extremely large as in > 10 million entries, with keys
> and values roughly about 100 bytes each.) The structure must support
> efficient adding and lookup of entries, as these two operations will be
> very frequent.
> 
> I did some online research, and it seems that hashtables perform poorly
> on disk, because the usual hash functions cause random scattering of
> related data (which are likely to be access with higher temporal
> locality), which incurs lots of disk seeks.
> 
> I thought about B-trees, but they have high overhead (and are a pain to
> implement), and also only exhibit good locality if table entries are
> accessed sequentially; the problem is I'm working with high-dimensional
> data and the order of accesses is unlikely to be sequential. However,
> they do exhibit good spatial locality in higher-dimensional space (i.e.,
> if entry X is accessed first, then the next entry Y is quite likely to
> be close to X in that space).  Does anybody know of a good data
> structure that can take advantage of this fact to minimize disk
> accesses?
> 
> 
> T
> 

sqlite file format seems to be fairly documented:
http://www.sqlite.org/fileformat.html