[OT] shuffling lines in a stream
Sergey Gromov
snake.scaly at gmail.com
Fri Oct 10 17:03:52 PDT 2008
Fri, 10 Oct 2008 17:45:28 -0500,
Andrei Alexandrescu wrote:
> Sean Kelly wrote:
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Fri, 10 Oct 2008 15:54:18 -0500,
> >>> Andrei Alexandrescu wrote:
> >>>> Sean Kelly wrote:
> >>>>> It would be slower than the seeking option, but something like a
> >>>>> randomized mergesort would work as well. If the program can buffer
> >>>>> k lines in a file containing n lines, then read the first k lines
> >>>>> into memory, shuffle them, and write them out to a temporary file.
> >>>>> Repeat until the input is exhausted. Now randomly pick two of the
> >>>>> temporary files and randomly merge them. Repeat until two
> >>>>> temporary files remain, then output the result of the final random
> >>>>> merge to the screen.
> >>>>>
> >>>>> For small files (ie. where n<k) the file would be read, shuffled in
> >>>>> memory, and printed to the screen, assuming the proper checks were
> >>>>> in place.
> >>>> I think I found a problem with your solution. Consider you break the
> >>>> file in three chunks: a, b, c. Then you pick at random b and c and
> >>>> randomly merge them. The problem is, you make early the decision
> >>>> that nothing in a will appear in the first two thirds of the result.
> >>>> So the quality of randomness suffers. How would you address that?
> >>>
> >>> After merging b and c you end up with a and bc. Then you randomly
> >>> merge these two files and lines from a have all the chances to appear
> >>> anywhere within result.
> >>>
> >>> When randomly merging, the probability of picking a line from a file
> >>> should be proportional to a number of lines left in that file.
> >>
> >> How do you "randomly merge"? Describe the process.
> >
> >
> > while( a.hasMore || b.hasMore )
> > {
> > if( rand % 2 )
> > r ~= a.readLine;
> > else
> > r ~= b.readLine;
> > }
> > while( a.hasMore )
> > r ~= a.readLine;
> > while( b.hasMore )
> > r ~= b.readLine;
> >
> > The downside to randomly choosing two files is that it's possible you
> > could end up always choosing a small and a large file, thus increasing
> > the chance that the large file will have a bunch of lines left over
> > which are essentially appended.
>
> Yah, Sergey got me thinking too. What if you have a large file that
> doesn't fit in memory by exactly one line?
>
> Then I shuffle (n - 1) lines at random, and there remains the last line
> to write. If I choose from either fragment with probability 0.5, the
> last line will be with high probability among the first in the result!
>
> > The easy way around this would be to maintain a list of files so only
> > equal-sized files are merged (similar to how mergesort works). However,
> > I think a better way would be to use some sort of bias for the random
> > check above so that different-sized files are on equal footing. Perhaps
> > write out the file with the first 4 bytes containing the number of lines
> > in that file and use that to construct the bias. I'll leave those with
> > better math-fu than I to come up with a formula which would be fair.
>
> (In similar situations I've put the number of lines right in the name of
> the temporary file. I kid you not! And it's O(1) too!)
>
> It looks like any fragment should be chosen from with a probability
> proportional to its length. The lingering question for me is whether
> that probability needs to be updated, or can be fixed at the beginning
> of merging. I think fixed doesn't quite work unless special measures are
> taken at whenever a range is done.
>
> Again many thanks to all participants for your illuminating insights.
Here's my little proof. Paste this into wikipedia to see formulas.
Probability of a line from <math>a</math> getting into the first line of
result:
<math>
p_a^1=\frac{n_a}{n_a+n_b}
</math>
Likewise for <math>b</math>:
<math>
p_b^1=\frac{n_b}{n_a+n_b}
</math>
Probability of a line from <math>a</math> becoming a second line of the
output, <math>p_a^2</math>, is:
<math>
p_a^2=p_a^2\Big|_{a_1} + p_a^2\Big|_{b_1}
</math>
where <math>p_a^2\Big|_{a_1}</math> and <math>p_a^2\Big|_{b_1}</math>
are probabilities of a line from <math>a</math> getting into position
<math>2</math> when the first line is from <math>a</math> and from
<math>b</math> respectively.
<math>
p_a^2\Big|_{a_1} = \frac{n_a}{n_a+n_b}\cdot\frac{n_a-1}{n_a-1+n_b} =
\frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}
</math>
<math>
p_a^2\Big|_{b_1} = \frac{n_b}{n_a+n_b}\cdot\frac{n_a}{n_a+n_b-1} = \frac
{n_a n_b}{(n_a+n_b)(n_a+n_b-1)}
</math>
Therefore
<math>
p_a^2 = \frac{n_a(n_a-1)}{(n_a+n_b)(n_a+n_b-1)}+\frac{n_a n_b}{(n_a+n_b)
(n_a+n_b-1)} = \frac{n_a(n_a+n_b-1)}{(n_a+n_b)(n_a+n_b-1)} = \frac{n_a}
{n_a+n_b}
</math>
So at least <math>p_a^1=p_a^2</math>. Induce anyone?
More information about the Digitalmars-d
mailing list