random k-sample of a file
Christopher Wright
dhasenan at gmail.com
Thu Oct 9 17:33:44 PDT 2008
Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from the
> file. (If the file has less than k lines, it should all be output.) The
> file's size is unbounded so loading it in memory is not an option. How'd
> you go about it?
>
>
> Andrei
You can't do a uniform random distribution without knowing the length.
Your options are:
- Hold the entire file in memory.
- Hold the entire file so far in memory. Once you get over a certain
threshold, remove lines randomly.
You have to tweak the random distribution for the axing method to make
sure it's uniform. If this is going to be used often on huge data sets,
that is a reasonable effort.
How would you need to tweak the random distribution? The easiest way is
to get two random numbers: what section to remove lines from and which
line in that section to remove stuff from. You can work out some
annoying recurrence to find out how to get the probability of axing from
one section given that it has been available for k rounds.
I might work out the recurrence tonight, but I don't have a chalkboard.
More information about the Digitalmars-d
mailing list