[OT] shuffling lines in a stream

Fri Oct 10 15:37:49 PDT 2008

Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
>> Fri, 10 Oct 2008 15:54:18 -0500,
>> Andrei Alexandrescu wrote:
>>> Sean Kelly wrote:
>>>> It would be slower than the seeking option, but something like a 
>>>> randomized mergesort would work as well.  If the program can buffer 
>>>> k lines in a file containing n lines, then read the first k lines 
>>>> into memory, shuffle them, and write them out to a temporary file.  
>>>> Repeat until the input is exhausted.  Now randomly pick two of the 
>>>> temporary files and randomly merge them.  Repeat until two temporary 
>>>> files remain, then output the result of the final random merge to 
>>>> the screen.
>>>>
>>>> For small files (ie. where n<k) the file would be read, shuffled in 
>>>> memory, and printed to the screen, assuming the proper checks were 
>>>> in place.
>>> I think I found a problem with your solution. Consider you break the 
>>> file in three chunks: a, b, c. Then you pick at random b and c and 
>>> randomly merge them. The problem is, you make early the decision that 
>>> nothing in a will appear in the first two thirds of the result. So 
>>> the quality of randomness suffers. How would you address that?
>>
>> After merging b and c you end up with a and bc.  Then you randomly 
>> merge these two files and lines from a have all the chances to appear 
>> anywhere within result.
>>
>> When randomly merging, the probability of picking a line from a file 
>> should be proportional to a number of lines left in that file.
> 
> How do you "randomly merge"? Describe the process.

while( a.hasMore || b.hasMore )
{
     if( rand % 2 )
         r ~= a.readLine;
     else
         r ~= b.readLine;
}
while( a.hasMore )
     r ~= a.readLine;
while( b.hasMore )
     r ~= b.readLine;

The downside to randomly choosing two files is that it's possible you 
could end up always choosing a small and a large file, thus increasing 
the chance that the large file will have a bunch of lines left over 
which are essentially appended.

The easy way around this would be to maintain a list of files so only 
equal-sized files are merged (similar to how mergesort works).  However, 
I think a better way would be to use some sort of bias for the random 
check above so that different-sized files are on equal footing.  Perhaps 
write out the file with the first 4 bytes containing the number of lines 
in that file and use that to construct the bias.  I'll leave those with 
better math-fu than I to come up with a formula which would be fair.

Sean