[OT] Best algorithm for extremely large hashtable?
Peter Alexander
peter.alexander.au at gmail.com
Fri Nov 15 12:46:00 PST 2013
On Friday, 15 November 2013 at 20:22:29 UTC, H. S. Teoh wrote:
> The graph is not explicitly stored, because it is far too
> large. The
> nodes are generated on-the-fly as needed; the challenge is to
> know
> whether a particular generated node has been generated before.
> Basically, the equivalent of a hashtable of previously-visited
> nodes,
> but I'm hoping for (1) some kind of compressed representation
> so that as
> many nodes as possible can be kept in RAM for speed, and (2)
> something
> more efficient for disk I/O, because I'm expecting the number of
> generated nodes will not fit in RAM and will have to be put on
> disk (and
> hashtables don't do very well with disk I/O due to the latency).
Have multiple levels of lookup.
For example, let's say your entire grid G is 1,000,000 x
1,000,000.
Have a top level grid H that is 1000 x 1000. Each H[i, j]
corresponds to a 1000 x 1000 region in G. If any cell has been
visited in that region then H[i, j] points to a direct
representation of that 1000 x 1000 subregion of G, if not, then
it points to null. Checking if G[i, j] has been visited is as
easy as:
visited(i, j) = H[i/1000, j/1000] && (*H[i/1000, j/1000])[i%1000,
j%1000];
You can easily page-in and page-out those 1000 x 1000 subregions
to disk if needed, just keeping the active ones in memory. H will
stay in memory (only 8MB). Obviously, you'd be better off using a
power of 2 instead of 1000 for fast div and mod.
And of course, you can have as many levels to this as you like.
More information about the Digitalmars-d
mailing list