CORRECTION: random k-sample of a file

Russell Lewis webmaster at villagersonline.com
Thu Oct 9 16:11:41 PDT 2008


Sorry, I overlooked the (obvious) fact that this biases things towards 
the end of the file.  Change the while loop as follows:

BEGIN CODE
   int count = k;
   while(!feof(stdin))
   {
     if(random(0,count) >= k)
       ReadALine();   // this discards the line
     else
     {
       int i = random(0,k);
       bufs[i] = ReadALine();
     }
     count++;
   }
END CODE

Note that I'm assuming that random(a,b) is a function that randomly 
returns an integer in the range [a,b).

This new version works because the most-recently-read line has a k/count 
  chance of getting picked up into the current working set, and when we 
do choose to discard some line, we give all of the old lines an equal 
chance of survival.

I don't have to to formally work out the math right now, but I'm pretty 
sure that that would give each line in the entire file a k/count chance 
of being in the set at any given point in the process.

Russ

Russell Lewis wrote:
> BEGIN CODE
> char[][] choose(int k)
> {
>   char[][] bufs; bufs.length = k;
> 
>   for(int i=0; i<k; i++)
>     bufs[i] = ReadALine();
> 
>   while(!feof(stdin))
>   {
>     int i = random(0,k);
>     bufs[i] = ReadALine();
>   }
> 
>   return bufs;
> }
> 
> END CODE
> 
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an 
>> exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given 
>> a number k and a file, outputs k lines picked uniformly at random from 
>> the file. (If the file has less than k lines, it should all be 
>> output.) The file's size is unbounded so loading it in memory is not 
>> an option. How'd you go about it?
>>
>>
>> Andrei



More information about the Digitalmars-d mailing list