[OT] The horizon of a stream

Sat Oct 25 18:00:59 PDT 2008

On Thu, 23 Oct 2008 14:29:36 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> wrote:
> Consider some text stream. We define the horizon of the stream as the 
> longest number of lines between two identical lines minus one, or zero 
> if the stream consists of unique lines.
> 
> For example, this stream has horizon 9:
> 
> lorem
> ipsum
> dolor
> habeas
> corpus
> ipsum
> sit
> amet
> rigor
> mortis
> consectetur
> coitus
> interruptus
> corpus
> adipisicing
> elit
> 
> This is because the word "corpus" occurs on line 5 and line 14. This is 
> the farthest two lines in the file are; notice that "ipsum" also occurs 
> twice, but at a smaller distance.
> 
> The challenge, should you blah blah, is to compute the horizon of a 
> potentially large file. Of course theoretical and practical speed are as 
> important as always.
> 
> This is a task that's interesting and subtle in quite a number of ways, 
> and it allows for many creative solutions. Therefore I intentionally did 
> not restrict it much. Let ideas flow!
> 
> 
> Andrei

The hash solution is simple and fast, but uses too much memory. So, sub-divide 
your data. 

Obviously, you cannot process the first half and then the second, but you can 
make multiple complete passes, and only consider a subset during each pass.

For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
applicable with  your data), and 'b' on the second pass etc. At the end of each 
pass you retain only the best horizion so far discovered and discard the rest.

Assuming all lower-case and roughly even distribution, you reduce your memory 
requirements to ~4%.

By way of example, I knocked up a data file constisting of 100+ million lines 
(1GB) of words chosen randomly from a dictionary. Using Perl & 26 passes, I got 
my result in just under 25 minutes using < 2MB of memory. 

Admittedly, with this size of file, the entire thing was in the file system 
caches for the second and subsequent passes which saved considerable time. If 
your file is much bigger, that probably wouldn't be the case. But, if you're 
going to write this in D, you can probably speed thing up a bit anyway. 

wc -l with a cold cache manages to achieve a throughput of 28mb/sec on my 
system, where perl only achieved a best of 60mb/sec with a warm cache.

I seem to recall seeing a memory-mapped file module in the D library somewhere. 
Maybe that would reduce the IO for multiple passes?

Also, given the minimal 2MB max. memory usage above, if you know (or can probe) 
your data, then you can reduce the number of passes by doing (say) a-f on the 
first pass, g-l on the second etc. 

Or if all your data begins with (say) "Line ...", then key off the appropriate 
position in the line to form your pass split. Or, if your file is really huge 
use the first two characters for a 676-way split. (And wait :)

Anyway, just a thought from a drive-by reader.