[OT] The horizon of a stream

Thu Oct 23 13:18:21 PDT 2008

Andrei Alexandrescu:
> At this point a discussion "ok, I'll store every line in a hashtable" 
> would have sufficed. I can tell straight away that this is not a 
> workable solution for a large file. It's actually happened to me :o|.

I understand, but I have 2 GB of RAM, so that solution (in Python) works most of the times. When you design an algorithm, you start with the simpler version (that's also easy to debug and test), and only when you experimentally see it doesn't suffice (too much slow, not enough RAM, etc) you try to improve it. When you debug the refined version you can also use the unittests developed with the basic version, this is very useful.

Another solution is to keep an associative array of just the string hash values, and lookup into the input file when you have a (possible) false positive.

Another solution that requires even less RAM is to use a bloom filter (I have created a bloom filter class in Python but not in D yet) to keep what strings are seen, and lookup into the input file (as before) to remove (possible) false positives.

Finally, looking up into the input file every time there's a match is too much slow, so you just store the candidate matches into a file or in RAM, to test them all in a second read of the file. If the bloom filter is defined well enough, then such possible matches aren't too many, and you can keep them in RAM.

Bye,
bearophile