[OT] Best algorithm for extremely large hashtable?

Fri Nov 15 12:21:11 PST 2013

On Fri, Nov 15, 2013 at 09:03:17PM +0100, qznc wrote:
> On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
[...]
> >I'm trying to implement an algorithm that traverses a very large
> >graph, and I need some kind of data structure to keep track of which
> >nodes have been visited, that (1) allows reasonably fast lookups
> >(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
> >some kind of compression would be nice).
> >
> >The graph nodes can be represented in various ways, but possibly the
> >most convenient representation is as n-dimensional vectors of
> >(relatively small) integers. Furthermore, graph edges are always
> >between vectors that differ only by a single coordinate; so the edges
> >of the graph may be thought of as a subset of the edges of an
> >n-dimensional grid. The hashtable, therefore, needs to represent some
> >connected subset of this grid in a space-efficient manner, that still
> >allows fast lookup times.
> >
> >The naïve approach of using an n-dimensional bit array is not
> >feasible because n can be quite large (up to 100 or so), and the size
> >of the grid itself can get up to about 10 in each direction, so we're
> >looking at a potential maximum size of 10^100, clearly impractical to
> >store explicitly.
> 
> So, -10 to 10 in discrete steps. This means 5 bits per dimension and
> 500 bits for a single coordinate. Is the graph distributed of a
> compute cluster or does it fit into single computer? With a few GB
> of RAM, this means your graph is quite sparse, yet nodes are
> connected ("differ only by a single coordinate")?

The graph is not explicitly stored, because it is far too large. The
nodes are generated on-the-fly as needed; the challenge is to know
whether a particular generated node has been generated before.
Basically, the equivalent of a hashtable of previously-visited nodes,
but I'm hoping for (1) some kind of compressed representation so that as
many nodes as possible can be kept in RAM for speed, and (2) something
more efficient for disk I/O, because I'm expecting the number of
generated nodes will not fit in RAM and will have to be put on disk (and
hashtables don't do very well with disk I/O due to the latency).

> Can you preprocess? I mean, walk all the nodes O(n) to compute a
> good (perfect?) hash function.

Walking the nodes defeats the purpose, because the whole idea is to
*not* have to generate every node, just those selected by the algorithm
(those are already very unwieldy in number, nevermind generating *all*
nodes).

> In general, I think you should either store the flag right in the
> graph node or mirror the graph structure.
[...]

That doesn't solve the problem when graph nodes are generated
on-the-fly, since I wouldn't know where to look up the node structure
without first knowing that it has been generated before.

T

-- 
Ph.D. = Permanent head Damage