[OT] The horizon of a stream

Jason House jason.james.house at gmail.com
Thu Oct 23 19:38:25 PDT 2008


By hash value, store a linked list of (line #, char offset within file)

When a hash collision occurs, append to the list, but also calculate the
horizon from the head of the list.

While the horizon is higher than previously observed, scan the list until a
string match is found and update the maximum observed horizon.


This performance is linear, but has extra string lookups over some
alternatives. It's possible to optimize the string comparisons with some
kind of temporary caching of strings such as weak pointers.



Andrei Alexandrescu wrote:

> Consider some text stream. We define the horizon of the stream as the
> longest number of lines between two identical lines minus one, or zero
> if the stream consists of unique lines.
> 
> For example, this stream has horizon 9:
> 
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
> 
> This is because the word "corpus" occurs on line 5 and line 14. This is
> the farthest two lines in the file are; notice that "ipsum" also occurs
> twice, but at a smaller distance.
> 
> The challenge, should you blah blah, is to compute the horizon of a
> potentially large file. Of course theoretical and practical speed are as
> important as always.
> 
> This is a task that's interesting and subtle in quite a number of ways,
> and it allows for many creative solutions. Therefore I intentionally did
> not restrict it much. Let ideas flow!
> 
> 
> Andrei




More information about the Digitalmars-d mailing list