random k-sample of a file

KennyTM~ kennytm at gmail.com
Thu Oct 9 12:21:31 PDT 2008


Ary Borenszweig wrote:
> Walter Bright wrote:
>> Andrei Alexandrescu wrote:
>>> The challenge, should you accept it, is to write a program that given 
>>> a number k and a file, outputs k lines picked uniformly at random 
>>> from the file. (If the file has less than k lines, it should all be 
>>> output.) The file's size is unbounded so loading it in memory is not 
>>> an option. How'd you go about it?
>>
>> 1. read the file, counting lines => nlines
>> 2. select k random numbers 0..nlines
>> 3. read the file line by line again, outputting the line if it's one 
>> of the selected lines
> 
> Can't this be done in one single pass?
> 
> My thoughts were:
> 
> 1. read the first k lines and assign them to the ones to return (the 
> "stored lines")
> 2. probability of replacing a line: p = ??
> 3. for each new line, with probability p, replace a stored line (I don't 
> know which), and decrease probability (p -= ??)
> 
> If you don't decrease the value of p in each iterations, the probability 
> of keeping a line in the first group tends to zero.
> 
> I don't know the values of ??... maybe someone else can find them?
> 
> (I don't know how to randomly choose k elements of n, n >= k).

I guess p depends on the number of lines unfortunately.
Because p needs to be decreasing with line count, otherwise the first 
few line will almost always be replaced (not uniform then).



More information about the Digitalmars-d mailing list