random k-sample of a file

Christopher Wright dhasenan at gmail.com
Thu Oct 9 17:33:44 PDT 2008


Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an 
> exercise for the interested in futile pastimes.
> 
> The challenge, should you accept it, is to write a program that given a 
> number k and a file, outputs k lines picked uniformly at random from the 
> file. (If the file has less than k lines, it should all be output.) The 
> file's size is unbounded so loading it in memory is not an option. How'd 
> you go about it?
> 
> 
> Andrei

You can't do a uniform random distribution without knowing the length.

Your options are:
- Hold the entire file in memory.
- Hold the entire file so far in memory. Once you get over a certain 
threshold, remove lines randomly.

You have to tweak the random distribution for the axing method to make 
sure it's uniform. If this is going to be used often on huge data sets, 
that is a reasonable effort.

How would you need to tweak the random distribution? The easiest way is 
to get two random numbers: what section to remove lines from and which 
line in that section to remove stuff from. You can work out some 
annoying recurrence to find out how to get the probability of axing from 
one section given that it has been available for k rounds.

I might work out the recurrence tonight, but I don't have a chalkboard.



More information about the Digitalmars-d mailing list