random k-sample of a file
Kirk McDonald
kirklin.mcdonald at gmail.com
Thu Oct 9 13:20:49 PDT 2008
Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from the
> file. (If the file has less than k lines, it should all be output.) The
> file's size is unbounded so loading it in memory is not an option. How'd
> you go about it?
>
>
> Andrei
This is a classic interview question. The solution for k == 1 is easy:
from random import randint
chosen_line = None
for i, line in enumerate(open('filename')):
if randint(0, i) == 0:
chosen_line = line
print chosen_line
(It is worth noting that randint() operates over an inclusive range.)
If you do the math, this works out to a uniform distribution. For
instance, say the file has three lines. Once we read in the third line,
there is a 1 out of 3 chance that it will be picked as the chosen_line.
Of the remaining 2 out of 3 chances, there is a 50% chance the second
line will be chosen, and a 50% chance of the first line.
Doing this for k > 1 becomes more complicated. We start by reading in
the first k lines of the file. (And if we run out of lines before that,
we're done.)
import itertools
from random import randint
f = open('filename')
k_lines = list(itertools.islice(f, k))
# Next we just iterate over the rest of the file. If we have exhausted
# the file, then the loop terminates immediately.
for i, line in enumerate(f, start=k):
if randint(0, i) == 0:
k_lines[randint(0, k-1)] = line
for line in k_lines:
print line
This is my first crack at a solution. I am not sure how close to a
uniform distribution this works out to.
--
Kirk McDonald
More information about the Digitalmars-d
mailing list