[OT] shuffling lines in a stream

Fri Oct 10 17:03:52 PDT 2008

Fri, 10 Oct 2008 17:45:28 -0500,
Andrei Alexandrescu wrote:
> Sean Kelly wrote:
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Fri, 10 Oct 2008 15:54:18 -0500,
> >>> Andrei Alexandrescu wrote:
> >>>> Sean Kelly wrote:
> >>>>> It would be slower than the seeking option, but something like a 
> >>>>> randomized mergesort would work as well.  If the program can buffer 
> >>>>> k lines in a file containing n lines, then read the first k lines 
> >>>>> into memory, shuffle them, and write them out to a temporary file.  
> >>>>> Repeat until the input is exhausted.  Now randomly pick two of the 
> >>>>> temporary files and randomly merge them.  Repeat until two 
> >>>>> temporary files remain, then output the result of the final random 
> >>>>> merge to the screen.
> >>>>>
> >>>>> For small files (ie. where n<k) the file would be read, shuffled in 
> >>>>> memory, and printed to the screen, assuming the proper checks were 
> >>>>> in place.
> >>>> I think I found a problem with your solution. Consider you break the 
> >>>> file in three chunks: a, b, c. Then you pick at random b and c and 
> >>>> randomly merge them. The problem is, you make early the decision 
> >>>> that nothing in a will appear in the first two thirds of the result. 
> >>>> So the quality of randomness suffers. How would you address that?
> >>>
> >>> After merging b and c you end up with a and bc.  Then you randomly 
> >>> merge these two files and lines from a have all the chances to appear 
> >>> anywhere within result.
> >>>
> >>> When randomly merging, the probability of picking a line from a file 
> >>> should be proportional to a number of lines left in that file.
> >>
> >> How do you "randomly merge"? Describe the process.
> > 
> > 
> > while( a.hasMore || b.hasMore )
> > {
> >     if( rand % 2 )
> >         r ~= a.readLine;
> >     else
> >         r ~= b.readLine;
> > }
> > while( a.hasMore )
> >     r ~= a.readLine;
> > while( b.hasMore )
> >     r ~= b.readLine;
> > 
> > The downside to randomly choosing two files is that it's possible you 
> > could end up always choosing a small and a large file, thus increasing 
> > the chance that the large file will have a bunch of lines left over 
> > which are essentially appended.
> 
> Yah, Sergey got me thinking too. What if you have a large file that 
> doesn't fit in memory by exactly one line?
> 
> Then I shuffle (n - 1) lines at random, and there remains the last line 
> to write. If I choose from either fragment with probability 0.5, the 
> last line will be with high probability among the first in the result!
> 
> > The easy way around this would be to maintain a list of files so only 
> > equal-sized files are merged (similar to how mergesort works).  However, 
> > I think a better way would be to use some sort of bias for the random 
> > check above so that different-sized files are on equal footing.  Perhaps 
> > write out the file with the first 4 bytes containing the number of lines 
> > in that file and use that to construct the bias.  I'll leave those with 
> > better math-fu than I to come up with a formula which would be fair.
> 
> (In similar situations I've put the number of lines right in the name of 
> the temporary file. I kid you not! And it's O(1) too!)
> 
> It looks like any fragment should be chosen from with a probability 
> proportional to its length. The lingering question for me is whether 
> that probability needs to be updated, or can be fixed at the beginning 
> of merging. I think fixed doesn't quite work unless special measures are 
> taken at whenever a range is done.
> 
> Again many thanks to all participants for your illuminating insights.

Here's my little proof.  Paste this into wikipedia to see formulas.

Probability of a line from <math>a</math> getting into the first line of 
result:

<math>
p_a^1=\frac{n_a}{n_a+n_b}
</math>

Likewise for <math>b</math>:

<math>
p_b^1=\frac{n_b}{n_a+n_b}
</math>

Probability of a line from <math>a</math> becoming a second line of the 
output, <math>p_a^2</math>, is:

<math>
p_a^2=p_a^2\Big|_{a_1} + p_a^2\Big|_{b_1}
</math>

where <math>p_a^2\Big|_{a_1}</math> and <math>p_a^2\Big|_{b_1}</math> 
are probabilities of a line from <math>a</math> getting into position 
<math>2</math> when the first line is from <math>a</math> and from 
<math>b</math> respectively.

<math>
p_a^2\Big|_{a_1} = \frac{n_a}{n_a+n_b}\cdot\frac{n_a-1}{n_a-1+n_b} = 
\frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}
</math>

<math>
p_a^2\Big|_{b_1} = \frac{n_b}{n_a+n_b}\cdot\frac{n_a}{n_a+n_b-1} = \frac
{n_a n_b}{(n_a+n_b)(n_a+n_b-1)}
</math>

Therefore

<math>
p_a^2 = \frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}+\frac{n_a n_b}{(n_a+n_b)
(n_a+n_b-1)} = \frac{n_a(n_a+n_b-1)}{(n_a+n_b)(n_a+n_b-1)} = \frac{n_a}
{n_a+n_b}
</math>

So at least <math>p_a^1=p_a^2</math>.  Induce anyone?