[OT] shuffling lines in a stream

Sergey Gromov snake.scaly at gmail.com
Fri Oct 10 14:50:28 PDT 2008


Fri, 10 Oct 2008 16:27:03 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Fri, 10 Oct 2008 16:10:29 -0500,
> > Andrei Alexandrescu wrote:
> >> Sergey Gromov wrote:
> >>> Fri, 10 Oct 2008 15:54:18 -0500,
> >>> Andrei Alexandrescu wrote:
> >>>> Sean Kelly wrote:
> >>>>> It would be slower than the seeking option, but something like a 
> >>>>> randomized mergesort would work as well.  If the program can buffer k 
> >>>>> lines in a file containing n lines, then read the first k lines into 
> >>>>> memory, shuffle them, and write them out to a temporary file.  Repeat 
> >>>>> until the input is exhausted.  Now randomly pick two of the temporary 
> >>>>> files and randomly merge them.  Repeat until two temporary files remain, 
> >>>>> then output the result of the final random merge to the screen.
> >>>>>
> >>>>> For small files (ie. where n<k) the file would be read, shuffled in 
> >>>>> memory, and printed to the screen, assuming the proper checks were in 
> >>>>> place.
> >>>> I think I found a problem with your solution. Consider you break the 
> >>>> file in three chunks: a, b, c. Then you pick at random b and c and 
> >>>> randomly merge them. The problem is, you make early the decision that 
> >>>> nothing in a will appear in the first two thirds of the result. So the 
> >>>> quality of randomness suffers. How would you address that?
> >>> After merging b and c you end up with a and bc.  Then you randomly merge 
> >>> these two files and lines from a have all the chances to appear anywhere 
> >>> within result.
> >>>
> >>> When randomly merging, the probability of picking a line from a file 
> >>> should be proportional to a number of lines left in that file.
> >> How do you "randomly merge"? Describe the process.
> > 
> > a[] and b[] are files to merge.  ab[] is the result.  a.length and 
> > b.length are number of lines left in either file.
> > 
> > while (a.length || b.length)
> > {
> >   if (uniform(gen, 0, a.length + b.length) < a.length)
> >   {
> >     ab.put(a.head);
> >     a.next();
> >   }
> >   else
> >   {
> >     ab.put(b.head);
> >     b.next();
> >   }
> > }
> 
> This should work.

I just rigorously proved that probability of line from a getting into 
output position 0 is equal to probability of line from a getting into 
position 1 and is equal to a.length/(a.length+b.length).  I need to 
recollect TeX to write this proof down for electronic use.

> Use of ranges noted :o).

Except that a.length and b.length should be stored in files somehow.



More information about the Digitalmars-d mailing list