random k-sample of a file

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Oct 9 13:34:14 PDT 2008


Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an 
>> exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given 
>> a number k and a file, outputs k lines picked uniformly at random from 
>> the file. (If the file has less than k lines, it should all be 
>> output.) The file's size is unbounded so loading it in memory is not 
>> an option. How'd you go about it?
>>
>>
>> Andrei
> 
> This is a classic interview question. The solution for k == 1 is easy:
> 
> from random import randint
> 
> chosen_line = None
> 
> for i, line in enumerate(open('filename')):
>     if randint(0, i) == 0:
>         chosen_line = line
> 
> print chosen_line
> 
> (It is worth noting that randint() operates over an inclusive range.)
> 
> If you do the math, this works out to a uniform distribution. For 
> instance, say the file has three lines. Once we read in the third line, 
> there is a 1 out of 3 chance that it will be picked as the chosen_line. 
> Of the remaining 2 out of 3 chances, there is a 50% chance the second 
> line will be chosen, and a 50% chance of the first line.
> 
> Doing this for k > 1 becomes more complicated. We start by reading in 
> the first k lines of the file. (And if we run out of lines before that, 
> we're done.)
> 
> import itertools
> from random import randint
> 
> f = open('filename')
> 
> k_lines = list(itertools.islice(f, k))
> 
> # Next we just iterate over the rest of the file. If we have exhausted
> # the file, then the loop terminates immediately.
> for i, line in enumerate(f, start=k):
>     if randint(0, i) == 0:
>         k_lines[randint(0, k-1)] = line
> 
> for line in k_lines:
>     print line
> 
> This is my first crack at a solution. I am not sure how close to a 
> uniform distribution this works out to.
> 

Oh forgot to mention this is superior numerically to bearophile's
because it's not affected by floating point quantization vagaries. But 
it came later too...

Andrei




More information about the Digitalmars-d mailing list