[OT] The horizon of a stream
BCS
ao at pathlink.com
Thu Oct 23 14:10:48 PDT 2008
(going at this blind so I might dup others ideas)
- walk the file line by line
- hash each line with a "good hash function"*
- line index and the start location of each line as a function of it's hash
- look for the hash bucket with the largest spread of line indexes
- grab that spread
- page in the needed file segments
- test if the lines are the same, if so return
- else try again
-- using some kind of LRU cache keep as many pages in ram as you can
-- prefer testing stuff that is in memory over stuff that is not.
The main issue will be the what-to-pick-next-function. It must be fast, not
requiter to much meta data, and work well with the cache.
Another option:
load every line into a database table with it's index:
INSERT INTO tmp SELECT MAX(index) - MIN(index) FROM lines GROUP BY line;
SELECT MAX(dif) INTO i, COUNT(dif) INTO c FROM tmp;
if(c == 0)
return 0;
else
return i;
More information about the Digitalmars-d
mailing list