[OT] Best algorithm for extremely large hashtable?

Fri Nov 15 12:46:00 PST 2013

On Friday, 15 November 2013 at 20:22:29 UTC, H. S. Teoh wrote:
> The graph is not explicitly stored, because it is far too 
> large. The
> nodes are generated on-the-fly as needed; the challenge is to 
> know
> whether a particular generated node has been generated before.
> Basically, the equivalent of a hashtable of previously-visited 
> nodes,
> but I'm hoping for (1) some kind of compressed representation 
> so that as
> many nodes as possible can be kept in RAM for speed, and (2) 
> something
> more efficient for disk I/O, because I'm expecting the number of
> generated nodes will not fit in RAM and will have to be put on 
> disk (and
> hashtables don't do very well with disk I/O due to the latency).

Have multiple levels of lookup.

For example, let's say your entire grid G is 1,000,000 x 
1,000,000.

Have a top level grid H that is 1000 x 1000. Each H[i, j] 
corresponds to a 1000 x 1000 region in G. If any cell has been 
visited in that region then H[i, j] points to a direct 
representation of that 1000 x 1000 subregion of G, if not, then 
it points to null. Checking if G[i, j] has been visited is as 
easy as:

visited(i, j) = H[i/1000, j/1000] && (*H[i/1000, j/1000])[i%1000, 
j%1000];

You can easily page-in and page-out those 1000 x 1000 subregions 
to disk if needed, just keeping the active ones in memory. H will 
stay in memory (only 8MB). Obviously, you'd be better off using a 
power of 2 instead of 1000 for fast div and mod.

And of course, you can have as many levels to this as you like.