[OT] shuffling lines in a stream

Sergey Gromov snake.scaly at gmail.com
Fri Oct 10 14:08:27 PDT 2008


Fri, 10 Oct 2008 15:54:18 -0500,
Andrei Alexandrescu wrote:
> Sean Kelly wrote:
> > It would be slower than the seeking option, but something like a 
> > randomized mergesort would work as well.  If the program can buffer k 
> > lines in a file containing n lines, then read the first k lines into 
> > memory, shuffle them, and write them out to a temporary file.  Repeat 
> > until the input is exhausted.  Now randomly pick two of the temporary 
> > files and randomly merge them.  Repeat until two temporary files remain, 
> > then output the result of the final random merge to the screen.
> > 
> > For small files (ie. where n<k) the file would be read, shuffled in 
> > memory, and printed to the screen, assuming the proper checks were in 
> > place.
> 
> I think I found a problem with your solution. Consider you break the 
> file in three chunks: a, b, c. Then you pick at random b and c and 
> randomly merge them. The problem is, you make early the decision that 
> nothing in a will appear in the first two thirds of the result. So the 
> quality of randomness suffers. How would you address that?

After merging b and c you end up with a and bc.  Then you randomly merge 
these two files and lines from a have all the chances to appear anywhere 
within result.

When randomly merging, the probability of picking a line from a file 
should be proportional to a number of lines left in that file.



More information about the Digitalmars-d mailing list