random k-sample of a file

Thu Oct 9 13:55:57 PDT 2008

Kirk McDonald wrote:
> Andrei Alexandrescu wrote:
>> Kirk McDonald wrote:
>>> Andrei Alexandrescu wrote:
>>>> I just ran across a nice little problem that I wanted to share as an 
>>>> exercise for the interested in futile pastimes.
>>>>
>>>> The challenge, should you accept it, is to write a program that 
>>>> given a number k and a file, outputs k lines picked uniformly at 
>>>> random from the file. (If the file has less than k lines, it should 
>>>> all be output.) The file's size is unbounded so loading it in memory 
>>>> is not an option. How'd you go about it?
>>>>
>>>>
>>>> Andrei
>>>
>>> This is a classic interview question. The solution for k == 1 is easy:
>>>
>>> from random import randint
>>>
>>> chosen_line = None
>>>
>>> for i, line in enumerate(open('filename')):
>>>     if randint(0, i) == 0:
>>>         chosen_line = line
>>>
>>> print chosen_line
>>>
>>> (It is worth noting that randint() operates over an inclusive range.)
>>>
>>> If you do the math, this works out to a uniform distribution. For 
>>> instance, say the file has three lines. Once we read in the third 
>>> line, there is a 1 out of 3 chance that it will be picked as the 
>>> chosen_line. Of the remaining 2 out of 3 chances, there is a 50% 
>>> chance the second line will be chosen, and a 50% chance of the first 
>>> line.
>>>
>>> Doing this for k > 1 becomes more complicated. We start by reading in 
>>> the first k lines of the file. (And if we run out of lines before 
>>> that, we're done.)
>>>
>>> import itertools
>>> from random import randint
>>>
>>> f = open('filename')
>>>
>>> k_lines = list(itertools.islice(f, k))
>>>
>>> # Next we just iterate over the rest of the file. If we have exhausted
>>> # the file, then the loop terminates immediately.
>>> for i, line in enumerate(f, start=k):
>>>     if randint(0, i) == 0:
>>>         k_lines[randint(0, k-1)] = line
>>>
>>> for line in k_lines:
>>>     print line
>>>
>>> This is my first crack at a solution. I am not sure how close to a 
>>> uniform distribution this works out to.
>>
>> This looks good too, but I'm not sure where i will start from in the 
>> loop. It should start at k + 1.
>>
>> But where's the D code you guys? D is better than Python at scripting. 
>> Ahem.
>>
>>
>> Andrei
> 
> No, i should start at k (which it does, that is what the start=k 
> parameter to enumerate() is for). This is because the range passed to 
> randint() starts at 0. The islice function (like slices in D and indeed 
> regular slices in Python) uses an [inclusive, exclusive) range. Thus, 
> itertools.islice(f, k) will grab the first k lines from the file. (The 
> opening index of 0 is implied.)
> 
> islice is a function for "slicing" arbitrary iterable objects, 
> particularly ones that can't normally be sliced. By slicing the first k 
> lines from the file in this way, we also consume the first k lines from 
> the file object, and so the for loop starts on line k, which is why I 
> pass start=k to enumerate(). (Remember, I am counting lines from 0.)
> 

Sounds good.

Andrei