[OT] Best algorithm for extremely large hashtable?

Fri Nov 15 12:32:45 PST 2013

On Fri, Nov 15, 2013 at 08:29:22PM +0100, Vladimir Panteleev wrote:
> On Friday, 15 November 2013 at 18:43:12 UTC, H. S. Teoh wrote:
[...]
> >I'm trying to implement an algorithm that traverses a very large
> >graph, and I need some kind of data structure to keep track of which
> >nodes have been visited, that (1) allows reasonably fast lookups
> >(preferably O(1)), and (2) doesn't require GB's of storage (i.e.,
> >some kind of compression would be nice).
> 
> A while ago I set out to write a solver for a group of problems
> which can be described as traversing in breath extremely large
> implicit graphs. Some code here (C++):
> https://github.com/CyberShadow/DDD. The project uses delayed
> duplicate detection, to allow the number of nodes to exceed
> available RAM.
> 
> What we found was that in certain cases, delayed duplicate detection
> beat hash tables even while filtering duplicates in memory, I think
> mainly because DDD is more parallelizable than hash tables.
> 
> You mentioned compression - perhaps a bloom filter will fit your
> needs, as an optimization?

Thanks!! I think delayed duplicate detection is what I'm looking for.
The bloom filter idea is an interesting one; I'll definitely consider it
as a later optimization once I get DDD working.

T

-- 
It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall