[OT] The horizon of a stream

Thu Oct 23 14:10:48 PDT 2008

(going at this blind so I might dup others ideas)

- walk the file line by line
- hash each line with a "good hash function"*
- line index and the start location of each line as a function of it's hash

- look for the hash bucket with the largest spread of line indexes
- grab that spread
- page in the needed file segments
- test if the lines are the same, if so return
- else try again
-- using some kind of LRU cache keep as many pages in ram as you can
-- prefer testing stuff that is in memory over stuff that is not.

The main issue will be the what-to-pick-next-function. It must be fast, not 
requiter to much meta data, and work well with the cache.

Another option:

load every line into a database table with it's index:

INSERT INTO tmp SELECT MAX(index) - MIN(index) FROM lines GROUP BY line;
SELECT MAX(dif) INTO i, COUNT(dif) INTO c FROM tmp;

if(c == 0)
  return 0;
else
  return i;