random k-sample of a file

Kirk McDonald kirklin.mcdonald at gmail.com
Thu Oct 9 13:20:49 PDT 2008


Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an 
> exercise for the interested in futile pastimes.
> 
> The challenge, should you accept it, is to write a program that given a 
> number k and a file, outputs k lines picked uniformly at random from the 
> file. (If the file has less than k lines, it should all be output.) The 
> file's size is unbounded so loading it in memory is not an option. How'd 
> you go about it?
> 
> 
> Andrei

This is a classic interview question. The solution for k == 1 is easy:

from random import randint

chosen_line = None

for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line

print chosen_line

(It is worth noting that randint() operates over an inclusive range.)

If you do the math, this works out to a uniform distribution. For 
instance, say the file has three lines. Once we read in the third line, 
there is a 1 out of 3 chance that it will be picked as the chosen_line. 
Of the remaining 2 out of 3 chances, there is a 50% chance the second 
line will be chosen, and a 50% chance of the first line.

Doing this for k > 1 becomes more complicated. We start by reading in 
the first k lines of the file. (And if we run out of lines before that, 
we're done.)

import itertools
from random import randint

f = open('filename')

k_lines = list(itertools.islice(f, k))

# Next we just iterate over the rest of the file. If we have exhausted
# the file, then the loop terminates immediately.
for i, line in enumerate(f, start=k):
     if randint(0, i) == 0:
         k_lines[randint(0, k-1)] = line

for line in k_lines:
     print line

This is my first crack at a solution. I am not sure how close to a 
uniform distribution this works out to.

-- 
Kirk McDonald



More information about the Digitalmars-d mailing list