[OT] The horizon of a stream

Thu Oct 23 13:28:02 PDT 2008

bearophile wrote:
> Andrei Alexandrescu:
>> At this point a discussion "ok, I'll store every line in a
>> hashtable" would have sufficed. I can tell straight away that this
>> is not a workable solution for a large file. It's actually happened
>> to me :o|.
> 
> I understand, but I have 2 GB of RAM, so that solution (in Python)
> works most of the times. When you design an algorithm, you start with
> the simpler version (that's also easy to debug and test), and only
> when you experimentally see it doesn't suffice (too much slow, not
> enough RAM, etc) you try to improve it. When you debug the refined
> version you can also use the unittests developed with the basic
> version, this is very useful.
> 
> Another solution is to keep an associative array of just the string
> hash values, and lookup into the input file when you have a
> (possible) false positive.

Now you're talking! This is a solution worth a lot of attention. Under
what circumstances it does well vs. not so well? How can you ensure it 
works for a stream of any size if multiple passes are possible? etc.

> Another solution that requires even less RAM is to use a bloom filter
> (I have created a bloom filter class in Python but not in D yet) to
> keep what strings are seen, and lookup into the input file (as
> before) to remove (possible) false positives.

Bloom filter is also an excellent idea!

Andrei