[OT] The horizon of a stream

bearophile bearophileHUGS at lycos.com
Sat Oct 25 19:52:51 PDT 2008


Nigel Sandever:
> For example, if you hash all lines beginning with 'a' (or 'A' or whatever is 
> applicable with  your data), and 'b' on the second pass etc. At the end of each 
> pass you retain only the best horizion so far discovered and discard the rest.

A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc, to split the input data into bins. But you want to fill those bins uniformly. So how can do it? The hash number if good is nearly random, so you can partition its bits to partition your data into more uniform groups :-) The downside is that you have to compute the hash for each line many times. To speed this up you can hash just a sample of the chars of each line...

Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM... You don't need to load the lines as strings, you can load the whole file as a single memory block in RAM. Then you don't even need slices (8 bytes) to represent a string, because their ending point is the end of the file or the successive newline, so you just need the starting points as keys and the line indexes as values. Unfortunately the built-in AA with the built-in memory allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see memory isn't enough to run your program, then you need to use a less memory-hungry hash map. You can find plenty around, written in C, I have translated one to D, it's easy. With that if the lines aren't too much short or there are some duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole associative array, so you need to load the file just one time.

Bye,
bearophile



More information about the Digitalmars-d mailing list