[OT] Efficient file structure for very large lookup tables?

Tue Dec 17 13:32:01 PST 2013

On Tuesday, 17 December 2013 at 20:50:23 UTC, H. S. Teoh wrote:
> On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh 
> wrote:
>> On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
>> >Another OT thread to pick your brains. :)
>> >
>> >What's a good, efficient file structure for storing extremely 
>> >large
>> >lookup tables? (Extremely large as in > 10 million entries, 
>> >with keys
>> >and values roughly about 100 bytes each.) The structure must 
>> >support
>> >efficient adding and lookup of entries, as these two 
>> >operations will
>> >be very frequent.
>> >
>> >I did some online research, and it seems that hashtables 
>> >perform
>> >poorly on disk, because the usual hash functions cause random
>> >scattering of related data (which are likely to be access 
>> >with higher
>> >temporal locality), which incurs lots of disk seeks.
>> >
>> >I thought about B-trees, but they have high overhead (and are 
>> >a pain
>> >to implement), and also only exhibit good locality if table 
>> >entries
>> >are accessed sequentially; the problem is I'm working with
>> >high-dimensional data and the order of accesses is unlikely 
>> >to be
>> >sequential.  However, they do exhibit good spatial locality in
>> >higher-dimensional space (i.e., if entry X is accessed first, 
>> >then
>> >the next entry Y is quite likely to be close to X in that 
>> >space).
>> >Does anybody know of a good data structure that can take 
>> >advantage of
>> >this fact to minimize disk accesses?
>> >
>> >
>> >T
>> 
>> As a first attempt could you use a key-value database (like 
>> REDIS if
>> you have enough memory to fit everything in)?  Or is that out 
>> of the
>> question.
>
> Well, currently I'm just using plain ole in-memory hash tables 
> to store
> the data. It works quite well for small problems, but I'm 
> finding that
> memory runs out quickly with problems of moderate size, so I 
> need to
> move the storage to disk. I'm just looking to minimize I/O 
> latency,
> which hash tables are notorious for. (A previous incarnation of 
> the
> program uses Berkeley DB hash tables, but it becomes I/O bound 
> and slow
> when the problem size is large.)
>

I can't really add anything on the fancy data 
structures/algorithms, it's all a bit out of my league.

That said, have you considered asynchronously maintaining the 
locality in memory, i.e. prefetching? If you have predictable 
access and good locality it should perform well. In particular, 
you can build a prefetcher that is tuned to the particular access 
pattern.