[OT] Efficient file structure for very large lookup tables?

Tue Dec 17 15:34:14 PST 2013

On Tue, Dec 17, 2013 at 09:43:37PM +0100, Dylan Knutson wrote:
> It's not a datastructure persay, but Google's LevelDB is very good,
> quite fast arbitrary key-value storage system (I use it in my
> project relating a 1 word key to a 200 word value) supporting batch
> transactional operations.
[...]

I'm not looking for a transactional DB, though; I'm looking for a fast
key-value storage system with very high insert/query throughput, and
hopefully good space requirements.  Deletion is not needed, though for
very large problems it might be useful to reduce disk space usage. It
must be able to handle huge numbers of insertions (say 10 million
entries for small/moderate problem instances, perhaps in the billions or
trillions, if I use it on a large problem instance), and must be able to
do so very fast -- the faster the better.

Preferably, it would take advantage of the n-dimensional locality that I
described in my other post. Basically, the idea is to reduce I/O
roundtrips by caching disk pages in memory, because there are far more
entries than will fit in memory all at once. But since the number of
entries is very large, to prevent thrashing I have to maximize the
number of cache hits. It does no good if I have to load 100 disk pages
to answer 100 queries; if 100 pages fit in RAM, I'd like most of the
entries in those 100 pages to be used in answering queries before they
get swapped out for new pages to be loaded in. In the ideal case
(probably not attainable), those 100 pages will contain exactly the
next, say, million entries I'm about to query, so that once I'm done
with them, I can swap those pages out and never have to load them back
in again, giving the space for the next 100 pages to be loaded and
queried. So the layout of the disk pages should be as closely
corresponding to the order the algorithm will be looking up entries as
possible. (Unfortunately it's not 100% predictable, as it depends on the
problem being solved, but at least there are general trends we can take
advantage of.) A generic storage scheme like SQLite will probably not
perform as fast as I'd like.

T

-- 
It is impossible to make anything foolproof because fools are so
ingenious. -- Sammy