[OT] Best algorithm for extremely large hashtable?

Fri Nov 15 12:51:29 PST 2013

On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
> On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
>> How is this the same as the edges of an n-dimensional grid?
>
> Basically, adjacent nodes differ only in a single coordinate, 
> and that
> difference can only be 1 or -1. So for the 2D case, if you 
> graph the
> nodes on the plane, the edges would be horizontal/vertical line 
> segments
> of unit length. If you consider the 2D grid's edges to be *all* 
> possible
> such line segments, then in that sense you could think of the 
> graph's
> edges as being a subset of the 2D grid's.
>
> I was hoping that this fact can be taken advantage of, to make 
> a compact
> representation of visited nodes.

How dense is the graph?

For example, if it contains every possible edge described (+-1 to 
every single coordinate), for a breadth-first search, we can 
manage to just keep track of a single integer: the farthest 
distance traveled.

For very dense graphs, you can perhaps apply a similar idea: 
represent large visited areas as "diamonds" with center and 
radius, then try to "factor" the visited areas into diamonds 
on-the-fly.  Possibly they will be "diamonds with a [much 
shorter] list of holes".  For example, we say that all nodes not 
further than 3 from (0, 0, ..., 0) are visited, and store a 
single center (origin) and a radius (3) instead of all 
(dimensions[100] to the power of distance[3]) of them.  The 
lookup will be at most O (number-of-diamonds) plus a hash table 
lookup for the individual nodes.  Perhaps for certain graphs, we 
can find a good compromise between the number of diamonds and the 
number of individual stored nodes.

The above is just an idea, perhaps it won't be feasible by itself 
when you get to the details, but can inspire some related 
optimization.

Also, the size of the border of n-dimensional area is a 
(n-1)-dimensional object, and for dense enough graphs, you can 
hope that the number of elements in it is less by an order of 
magnitude than the total number of visited nodes.  However, for 
too much dimensions (as in your case, 10^100 -> 10^99), it does 
not seem to help much.

Another question is, when will the graph traversal end?  For 
example, if you are certain that you won't need to visit more 
than say one million nodes, a simple hash table storing the node 
representations at the hash indices will suffice.  On the other 
hand, if you plan to visit 10^12 nodes, and the graph is not very 
sparse or very dense (and not regular in any obvious way besides 
what is described), perhaps you won't get the required 
compression level (1/1000) anyway.

Ivan Kazmenko.