random k-sample of a file

Ary Borenszweig ary at esperanto.org.ar
Thu Oct 9 12:16:38 PDT 2008


Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> The challenge, should you accept it, is to write a program that given 
>> a number k and a file, outputs k lines picked uniformly at random from 
>> the file. (If the file has less than k lines, it should all be 
>> output.) The file's size is unbounded so loading it in memory is not 
>> an option. How'd you go about it?
> 
> 1. read the file, counting lines => nlines
> 2. select k random numbers 0..nlines
> 3. read the file line by line again, outputting the line if it's one of 
> the selected lines

Can't this be done in one single pass?

My thoughts were:

1. read the first k lines and assign them to the ones to return (the 
"stored lines")
2. probability of replacing a line: p = ??
3. for each new line, with probability p, replace a stored line (I don't 
know which), and decrease probability (p -= ??)

If you don't decrease the value of p in each iterations, the probability 
of keeping a line in the first group tends to zero.

I don't know the values of ??... maybe someone else can find them?

(I don't know how to randomly choose k elements of n, n >= k).



More information about the Digitalmars-d mailing list