[OT] The horizon of a stream

Thu Oct 23 14:04:40 PDT 2008

Thu, 23 Oct 2008 14:29:36 -0500,
Andrei Alexandrescu wrote:
> Consider some text stream. We define the horizon of the stream as the 
> longest number of lines between two identical lines minus one, or zero 
> if the stream consists of unique lines.
> 
> For example, this stream has horizon 9:
> 
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
> 
> This is because the word "corpus" occurs on line 5 and line 14. This is 
> the farthest two lines in the file are; notice that "ipsum" also occurs 
> twice, but at a smaller distance.
> 
> The challenge, should you blah blah, is to compute the horizon of a 
> potentially large file. Of course theoretical and practical speed are as 
> important as always.
> 
> This is a task that's interesting and subtle in quite a number of ways, 
> and it allows for many creative solutions. Therefore I intentionally did 
> not restrict it much. Let ideas flow!

Traverse the stream, line by line.  If the word is new, add it to a 
dictionary, size_t[string], with the current line number.  If it's 
already in the dictionary, look at the distance to its first occurence 
and update horizon if that distance is greater than what we already have.  
If we don't run out of memory, we're done in one pass.

If we do run out of memory, recover ;) then continue without adding words 
into the dictionary but dumping them into a temporary file instead, just 
checking and updating the horizon.

The stream ends.  If our current horizon is greater than the number of 
lines we dumped into a temp file, we're done.  Otherwise, empty the 
dictionary, open the temp file, and repeat.