[OT] The horizon of a stream

Sun Oct 26 00:39:50 PDT 2008

Nigel Sandever:

>I did try that (using md5), but the penalty in Perl was horrible,<

This is a D newsgroup, so use D, it allows you to manage bits more efficiently.

md5 is too much slow for this purpose, use the built-in hashing of strings.

>That said. For most applications of this, one would probably have some feel for the dataset(s) one is likely to deal with. And trading domain specific knowledge for total generality is often the best optimisation one can perform.<

I agree, when you have gigabytes of data it's better to gain some knowledge about the data before process it.

>But provided algorithms are written to assume files larger than memory, the numbers do scale linearly.<

Several of the provided algorithms don't assume files larger than memory, or they work quite better when the file is smaller than memory.

>I used (a slightly modified version of) 2of12inf available from<

That's a quite complex file, so I suggest something simpler, as this after a cleaning of the non ASCII words:
http://www.norvig.com/big.txt

Bye,
bearophile