[OT] Efficient file structure for very large lookup tables?

Tue Dec 17 12:48:39 PST 2013

On Tue, Dec 17, 2013 at 08:47:17PM +0100, Craig Dillabaugh wrote:
> On Tuesday, 17 December 2013 at 19:09:49 UTC, H. S. Teoh wrote:
> >Another OT thread to pick your brains. :)
> >
> >What's a good, efficient file structure for storing extremely large
> >lookup tables? (Extremely large as in > 10 million entries, with keys
> >and values roughly about 100 bytes each.) The structure must support
> >efficient adding and lookup of entries, as these two operations will
> >be very frequent.
> >
> >I did some online research, and it seems that hashtables perform
> >poorly on disk, because the usual hash functions cause random
> >scattering of related data (which are likely to be access with higher
> >temporal locality), which incurs lots of disk seeks.
> >
> >I thought about B-trees, but they have high overhead (and are a pain
> >to implement), and also only exhibit good locality if table entries
> >are accessed sequentially; the problem is I'm working with
> >high-dimensional data and the order of accesses is unlikely to be
> >sequential.  However, they do exhibit good spatial locality in
> >higher-dimensional space (i.e., if entry X is accessed first, then
> >the next entry Y is quite likely to be close to X in that space).
> >Does anybody know of a good data structure that can take advantage of
> >this fact to minimize disk accesses?
> >
> >
> >T
> 
> As a first attempt could you use a key-value database (like REDIS if
> you have enough memory to fit everything in)?  Or is that out of the
> question.

Well, currently I'm just using plain ole in-memory hash tables to store
the data. It works quite well for small problems, but I'm finding that
memory runs out quickly with problems of moderate size, so I need to
move the storage to disk. I'm just looking to minimize I/O latency,
which hash tables are notorious for. (A previous incarnation of the
program uses Berkeley DB hash tables, but it becomes I/O bound and slow
when the problem size is large.)

> Another question is can your queries be batched?  If that is the
> case and your data is bigger than your available memory, then try
> Googling "Lars Arge Buffer Tree" which might work well.

Hmm. I'm not sure if it's possible to batch the queries unless I
multithread the algorithm; the queries are generated on-the-fly. But
they *are* somewhat predictable (spatial locality: see below), so maybe
that might help?

[...]
> If I had to design something quick on the spot, my first guess would
> be to use a grid on the first two dimensions and then bin the
> 'points' or keys within each grid square and build a simpler
> structure on those.  This won't work so well though for really high
> dimension data or if the 'points' are randomly distributed.
>
> Also, what exactly do you mean by "in that space" when you say:
> 
> "if entry X is accessed first, then the next entry Y is quite likely
> to be close to X in that space".
> 
> Do you mean that the value of Y in the next dimension is numerically
> close (or expected to be) to X?
[...]

How many dimensions is considered "really high"?  The number of
dimensions can be up to about 50 or so. Is that high? Also, the points
are not randomly distributed; they are highly-connected (i.e., they tend
to be adjacent to each other).

Basically, one possible representation of the keys is as an
n-dimensional array of integers. (They are not natively in that form,
but can be compressed into that form easily.) And the way the algorithm
works is that if the key (p1,p2,p3,...) is accessed, then the next key
(q1,q2,q3,...) accessed is likely to be differ from (p1,p2,p3,...) only
in a small number of coordinates, and the difference in coordinates is
likely to be very small (usually in the +1/-1 range).

Actually, now that I think of it... the problem may be far more
tractable than I realized. I was previously thinking of some kind of
n-dimensional space-partitioning storage scheme, where the state space
is partitioned into n-dimensional cube buckets, and points are placed in
their corresponding buckets. However, this had the problem that the
n-cube contains 2^n vertices, so the bucket sizes would have to be k^n,
and since I can't control n, it makes it difficult to map bucket sizes
to page sizes for optimal I/O performance. For large n, the bucket size
would be unmanageably big (e.g., in the 50-dimensional case).

But it just dawned on me that partitioning into n-cubes is unnecessary,
because most of the time, I only care about points that differ in a
small number of coordinates from the current point. So a more efficient
bucket shape would be something closer to an n-cross (higher-dimensional
equivalent of the octahedron), which only has 2*n vertices, or one of
its derivatives. This makes bucket sizes far easier to control, and far
more efficient to store. The tricky part is only, how to partition
n-space into n-cross-shaped pieces? It would have to be some kind of
tiling (otherwise there'd be ambiguity over which bucket a given point
falls into). So now I've to do some research into n-dimensional
tilings... :P

T

-- 
"The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."