[OT] shuffling lines in a stream

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Fri Oct 10 08:12:33 PDT 2008


Sergey Gromov wrote:
> Fri, 10 Oct 2008 09:20:44 -0500,
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 07:55:38 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sergey Gromov wrote:
>>>>> Well, my temporary file contains the whole stream.  Because there is a 
>>>>> probability that the last line of the stream goes first.
>>>> Yes, that probability exists, but that doesn't imply you must load the 
>>>> entire file in memory.
>>> This implies though that I cannot start output before I read all the 
>>> lines from the stream.  Therefore I must store those lines somewhere.
>> Creating temporary files is fine as long as you never need to store the 
>> entire input in memory.
> 
> Ok, let me re-formulate the task then, since the creation of a temporary 
> file containing all the lines seems inevitable.  You basically want to 
> uniformly shuffle lines in a very large file.
> 
> The idea is this.  Let N be number of lines in the file, which you know 
> after storing it.  There are two passes, forward and backward.  They are 
> symmetric.  On a forward pass you read as many lines as you can fit in 
> memory, let it be m.  Shuffle them.  Then choose a number, k which is
> 0 < k < m and is somehow based upon the m and N, like k = m * m/N.  
> Write the k first lines back into the file and discard them from memory.  
> Append as many new lines to your memory array as possible.  Shuffle.  
> Choose k.  Write.  Repeat.  The reverse pass is exactly the same but you 
> start from the end of file.
> 
> The tricky part is how to choose k.  It obviously needs to take number 
> of discarded lines, q, into account like in k = m * m/(N-q).  But my 
> math-fu is almost non-existent so I'm unable to prove if this approach 
> can give an uniform distribution.  I feel like it can though.

Feeling doesn't qualify.

Andrei



More information about the Digitalmars-d mailing list