[OT] The horizon of a stream
Nigel Sandever
nigelsandever at btconnect.com
Sat Oct 25 20:30:33 PDT 2008
On Sat, 25 Oct 2008 22:52:51 -0400, bearophile <bearophileHUGS at lycos.com> wrote:
> Nigel Sandever:
> > For example, if you hash all lines beginning with 'a' (or 'A' or whatever is
> > applicable with your data), and 'b' on the second pass etc. At the end of
each
> > pass you retain only the best horizion so far discovered and discard the
rest.
>
> A suggestion: you consider only lines that start with 'A'/'a', 'B'/'b', etc,
to split the input data into bins. But you want to fill those bins uniformly. So
how can do it? The hash number if good is nearly random, so you can partition
its bits to partition your data into more uniform groups :-) The downside is
that you have to compute the hash for each line many times. To speed this up you
can hash just a sample of the chars of each line...
I did try that (using md5), but the penalty in Perl was horrible, The hash(AA)
just had to rehash the binary md5 anyway, so it was pure cost for my known
dataset.
For a real application with unknown/non-uniform data it would work like a charm
though ... assuming you don't hit that rarest of non-impossibilities: duplicate
md5s of non-identical inputs :)
That said. For most applications of this, one would probably have some feel for
the dataset(s) one is likely to deal with. And trading domain specific knowledge
for total generality is often the best optimisation one can perform.
>
> Note that 1 GB file is too much small for a true benchmark, I have 2 GB RAM...
Agreed and ditto. But provided algorithms are written to assume files larger
than memory, the numbers do scale linearly.
You don't need to load the lines as strings, you can load the whole file as a
single memory block in RAM. Then you don't even need slices (8 bytes) to
represent a string, because their ending point is the end of the file or the
successive newline, so you just need the starting points as keys and the line
indexes as values. Unfortunately the built-in AA with the built-in memory
allocator uses the same memory if your keys are 8 or 4 bytes long. So if you see
memory isn't enough to run your program, then you need to use a less memory-
hungry hash map. You can find plenty around, written in C, I have translated one
to D, it's easy. With that if the lines aren't too much short or there are some
duplicate lines, then the RAM (assuming to use 2 GB) suffices to keep the whole
associative array, so you need to load the file just one time.
I agree that a 'testset producer' and a freely accessible known dicionary is a
better option.
I used (a slightly modified version of) 2of12inf available from
http://downloads.sourceforge.net/wordlist/alt12dicts-4.tar.gz
>
> Bye,
> bearophile
More information about the Digitalmars-d
mailing list