[OT] Best algorithm for extremely large hashtable?
Ivan Kazmenko
gassa at mail.ru
Fri Nov 15 12:51:29 PST 2013
On Friday, 15 November 2013 at 20:07:24 UTC, H. S. Teoh wrote:
> On Fri, Nov 15, 2013 at 08:48:19PM +0100, John Colvin wrote:
>> How is this the same as the edges of an n-dimensional grid?
>
> Basically, adjacent nodes differ only in a single coordinate,
> and that
> difference can only be 1 or -1. So for the 2D case, if you
> graph the
> nodes on the plane, the edges would be horizontal/vertical line
> segments
> of unit length. If you consider the 2D grid's edges to be *all*
> possible
> such line segments, then in that sense you could think of the
> graph's
> edges as being a subset of the 2D grid's.
>
> I was hoping that this fact can be taken advantage of, to make
> a compact
> representation of visited nodes.
How dense is the graph?
For example, if it contains every possible edge described (+-1 to
every single coordinate), for a breadth-first search, we can
manage to just keep track of a single integer: the farthest
distance traveled.
For very dense graphs, you can perhaps apply a similar idea:
represent large visited areas as "diamonds" with center and
radius, then try to "factor" the visited areas into diamonds
on-the-fly. Possibly they will be "diamonds with a [much
shorter] list of holes". For example, we say that all nodes not
further than 3 from (0, 0, ..., 0) are visited, and store a
single center (origin) and a radius (3) instead of all
(dimensions[100] to the power of distance[3]) of them. The
lookup will be at most O (number-of-diamonds) plus a hash table
lookup for the individual nodes. Perhaps for certain graphs, we
can find a good compromise between the number of diamonds and the
number of individual stored nodes.
The above is just an idea, perhaps it won't be feasible by itself
when you get to the details, but can inspire some related
optimization.
Also, the size of the border of n-dimensional area is a
(n-1)-dimensional object, and for dense enough graphs, you can
hope that the number of elements in it is less by an order of
magnitude than the total number of visited nodes. However, for
too much dimensions (as in your case, 10^100 -> 10^99), it does
not seem to help much.
Another question is, when will the graph traversal end? For
example, if you are certain that you won't need to visit more
than say one million nodes, a simple hash table storing the node
representations at the hash indices will suffice. On the other
hand, if you plan to visit 10^12 nodes, and the graph is not very
sparse or very dense (and not regular in any obvious way besides
what is described), perhaps you won't get the required
compression level (1/1000) anyway.
Ivan Kazmenko.
More information about the Digitalmars-d
mailing list