[OT] shuffling lines in a stream

Fri Oct 10 15:45:28 PDT 2008

Sean Kelly wrote:
> Andrei Alexandrescu wrote:
>> Sergey Gromov wrote:
>>> Fri, 10 Oct 2008 15:54:18 -0500,
>>> Andrei Alexandrescu wrote:
>>>> Sean Kelly wrote:
>>>>> It would be slower than the seeking option, but something like a 
>>>>> randomized mergesort would work as well.  If the program can buffer 
>>>>> k lines in a file containing n lines, then read the first k lines 
>>>>> into memory, shuffle them, and write them out to a temporary file.  
>>>>> Repeat until the input is exhausted.  Now randomly pick two of the 
>>>>> temporary files and randomly merge them.  Repeat until two 
>>>>> temporary files remain, then output the result of the final random 
>>>>> merge to the screen.
>>>>>
>>>>> For small files (ie. where n<k) the file would be read, shuffled in 
>>>>> memory, and printed to the screen, assuming the proper checks were 
>>>>> in place.
>>>> I think I found a problem with your solution. Consider you break the 
>>>> file in three chunks: a, b, c. Then you pick at random b and c and 
>>>> randomly merge them. The problem is, you make early the decision 
>>>> that nothing in a will appear in the first two thirds of the result. 
>>>> So the quality of randomness suffers. How would you address that?
>>>
>>> After merging b and c you end up with a and bc.  Then you randomly 
>>> merge these two files and lines from a have all the chances to appear 
>>> anywhere within result.
>>>
>>> When randomly merging, the probability of picking a line from a file 
>>> should be proportional to a number of lines left in that file.
>>
>> How do you "randomly merge"? Describe the process.
> 
> 
> while( a.hasMore || b.hasMore )
> {
>     if( rand % 2 )
>         r ~= a.readLine;
>     else
>         r ~= b.readLine;
> }
> while( a.hasMore )
>     r ~= a.readLine;
> while( b.hasMore )
>     r ~= b.readLine;
> 
> The downside to randomly choosing two files is that it's possible you 
> could end up always choosing a small and a large file, thus increasing 
> the chance that the large file will have a bunch of lines left over 
> which are essentially appended.

Yah, Sergey got me thinking too. What if you have a large file that 
doesn't fit in memory by exactly one line?

Then I shuffle (n - 1) lines at random, and there remains the last line 
to write. If I choose from either fragment with probability 0.5, the 
last line will be with high probability among the first in the result!

> The easy way around this would be to maintain a list of files so only 
> equal-sized files are merged (similar to how mergesort works).  However, 
> I think a better way would be to use some sort of bias for the random 
> check above so that different-sized files are on equal footing.  Perhaps 
> write out the file with the first 4 bytes containing the number of lines 
> in that file and use that to construct the bias.  I'll leave those with 
> better math-fu than I to come up with a formula which would be fair.

(In similar situations I've put the number of lines right in the name of 
the temporary file. I kid you not! And it's O(1) too!)

It looks like any fragment should be chosen from with a probability 
proportional to its length. The lingering question for me is whether 
that probability needs to be updated, or can be fixed at the beginning 
of merging. I think fixed doesn't quite work unless special measures are 
taken at whenever a range is done.

Again many thanks to all participants for your illuminating insights.

Andrei