[OT] shuffling lines in a stream

Sergey Gromov snake.scaly at gmail.com
Fri Oct 10 08:09:26 PDT 2008


Fri, 10 Oct 2008 09:20:44 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Fri, 10 Oct 2008 07:55:38 -0500,
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Well, my temporary file contains the whole stream.  Because there is a 
> >>> probability that the last line of the stream goes first.
> >>
> >> Yes, that probability exists, but that doesn't imply you must load the 
> >> entire file in memory.
> > 
> > This implies though that I cannot start output before I read all the 
> > lines from the stream.  Therefore I must store those lines somewhere.
> 
> Creating temporary files is fine as long as you never need to store the 
> entire input in memory.

Ok, let me re-formulate the task then, since the creation of a temporary 
file containing all the lines seems inevitable.  You basically want to 
uniformly shuffle lines in a very large file.

The idea is this.  Let N be number of lines in the file, which you know 
after storing it.  There are two passes, forward and backward.  They are 
symmetric.  On a forward pass you read as many lines as you can fit in 
memory, let it be m.  Shuffle them.  Then choose a number, k which is
0 < k < m and is somehow based upon the m and N, like k = m * m/N.  
Write the k first lines back into the file and discard them from memory.  
Append as many new lines to your memory array as possible.  Shuffle.  
Choose k.  Write.  Repeat.  The reverse pass is exactly the same but you 
start from the end of file.

The tricky part is how to choose k.  It obviously needs to take number 
of discarded lines, q, into account like in k = m * m/(N-q).  But my 
math-fu is almost non-existent so I'm unable to prove if this approach 
can give an uniform distribution.  I feel like it can though.



More information about the Digitalmars-d mailing list