CORRECTION: random k-sample of a file
Russell Lewis
webmaster at villagersonline.com
Thu Oct 9 16:11:41 PDT 2008
Sorry, I overlooked the (obvious) fact that this biases things towards
the end of the file. Change the while loop as follows:
BEGIN CODE
int count = k;
while(!feof(stdin))
{
if(random(0,count) >= k)
ReadALine(); // this discards the line
else
{
int i = random(0,k);
bufs[i] = ReadALine();
}
count++;
}
END CODE
Note that I'm assuming that random(a,b) is a function that randomly
returns an integer in the range [a,b).
This new version works because the most-recently-read line has a k/count
chance of getting picked up into the current working set, and when we
do choose to discard some line, we give all of the old lines an equal
chance of survival.
I don't have to to formally work out the math right now, but I'm pretty
sure that that would give each line in the entire file a k/count chance
of being in the set at any given point in the process.
Russ
Russell Lewis wrote:
> BEGIN CODE
> char[][] choose(int k)
> {
> char[][] bufs; bufs.length = k;
>
> for(int i=0; i<k; i++)
> bufs[i] = ReadALine();
>
> while(!feof(stdin))
> {
> int i = random(0,k);
> bufs[i] = ReadALine();
> }
>
> return bufs;
> }
>
> END CODE
>
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an
>> exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given
>> a number k and a file, outputs k lines picked uniformly at random from
>> the file. (If the file has less than k lines, it should all be
>> output.) The file's size is unbounded so loading it in memory is not
>> an option. How'd you go about it?
>>
>>
>> Andrei
More information about the Digitalmars-d
mailing list