[OT] The horizon of a stream

Sat Oct 25 20:30:33 PDT 2008

On Sat, 25 Oct 2008 22:52:51 -0400, bearophile <bearophileHUGS at lycos.com> wrote:
> Nigel Sandever:
> > For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
> > applicable with  your data), and 'b' on the second pass etc. At the end of 
each 
> > pass you retain only the best horizion so far discovered and discard the 
rest.
> 
> A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc,
to split the input data into bins. But you want to fill those bins uniformly. So 
how can do it? The hash number if good is nearly random, so you can partition 
its bits to partition your data into more uniform groups :-) The downside is 
that you have to compute the hash for each line many times. To speed this up you 
can hash just a sample of the chars of each line...

I did try that (using md5), but the penalty in Perl was horrible, The hash(AA) 
just had to rehash the binary md5 anyway, so it was pure cost for my known 
dataset. 

For a real application with unknown/non-uniform data it would work like a charm 
though ... assuming you don't hit that rarest of non-impossibilities: duplicate 
md5s of non-identical inputs :)

That said. For most applications of this, one would probably have some feel for 
the dataset(s) one is likely to deal with. And trading domain specific knowledge 
for total generality is often the best optimisation one can perform.

> 
> Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... 

Agreed and ditto. But provided algorithms are written to assume files larger 
than memory, the numbers do scale linearly.

You don't need to load the lines as strings, you can load the whole file as a 
single memory block in RAM. Then you don't even need slices (8 bytes) to 
represent a string, because their ending point is the end of the file or the 
successive newline, so you just need the starting points as keys and the line 
indexes as values. Unfortunately the built-in AA with the built-in memory 
allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see 
memory isn't enough to run your program, then you need to use a less memory-
hungry hash map. You can find plenty around, written in C, I have translated one 
to D, it's easy. With that if the lines aren't too much short or there are some 
duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole 
associative array, so you need to load the file just one time.

I agree that a 'testset producer' and a freely accessible known dicionary is a 
better option.

I used (a slightly modified version of) 2of12inf available from 
	http://downloads.sourceforge.net/wordlist/alt12dicts-4.tar.gz

> 
> Bye,
> bearophile