[OT] shuffling lines in a stream
Sergey Gromov
snake.scaly at gmail.com
Fri Oct 10 08:09:26 PDT 2008
Fri, 10 Oct 2008 09:20:44 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Fri, 10 Oct 2008 07:55:38 -0500,
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Well, my temporary file contains the whole stream. Because there is a
> >>> probability that the last line of the stream goes first.
> >>
> >> Yes, that probability exists, but that doesn't imply you must load the
> >> entire file in memory.
> >
> > This implies though that I cannot start output before I read all the
> > lines from the stream. Therefore I must store those lines somewhere.
>
> Creating temporary files is fine as long as you never need to store the
> entire input in memory.
Ok, let me re-formulate the task then, since the creation of a temporary
file containing all the lines seems inevitable. You basically want to
uniformly shuffle lines in a very large file.
The idea is this. Let N be number of lines in the file, which you know
after storing it. There are two passes, forward and backward. They are
symmetric. On a forward pass you read as many lines as you can fit in
memory, let it be m. Shuffle them. Then choose a number, k which is
0 < k < m and is somehow based upon the m and N, like k = m * m/N.
Write the k first lines back into the file and discard them from memory.
Append as many new lines to your memory array as possible. Shuffle.
Choose k. Write. Repeat. The reverse pass is exactly the same but you
start from the end of file.
The tricky part is how to choose k. It obviously needs to take number
of discarded lines, q, into account like in k = m * m/(N-q). But my
math-fu is almost non-existent so I'm unable to prove if this approach
can give an uniform distribution. I feel like it can though.
More information about the Digitalmars-d
mailing list