random k-sample of a file

Kirk McDonald kirklin.mcdonald at gmail.com
Thu Oct 9 13:41:36 PDT 2008


Andrei Alexandrescu wrote:
> Kirk McDonald wrote:
>> Andrei Alexandrescu wrote:
>>> I just ran across a nice little problem that I wanted to share as an 
>>> exercise for the interested in futile pastimes.
>>>
>>> The challenge, should you accept it, is to write a program that given 
>>> a number k and a file, outputs k lines picked uniformly at random 
>>> from the file. (If the file has less than k lines, it should all be 
>>> output.) The file's size is unbounded so loading it in memory is not 
>>> an option. How'd you go about it?
>>>
>>>
>>> Andrei
>>
>> This is a classic interview question. The solution for k == 1 is easy:
>>
>> from random import randint
>>
>> chosen_line = None
>>
>> for i, line in enumerate(open('filename')):
>>     if randint(0, i) == 0:
>>         chosen_line = line
>>
>> print chosen_line
>>
>> (It is worth noting that randint() operates over an inclusive range.)
>>
>> If you do the math, this works out to a uniform distribution. For 
>> instance, say the file has three lines. Once we read in the third 
>> line, there is a 1 out of 3 chance that it will be picked as the 
>> chosen_line. Of the remaining 2 out of 3 chances, there is a 50% 
>> chance the second line will be chosen, and a 50% chance of the first 
>> line.
>>
>> Doing this for k > 1 becomes more complicated. We start by reading in 
>> the first k lines of the file. (And if we run out of lines before 
>> that, we're done.)
>>
>> import itertools
>> from random import randint
>>
>> f = open('filename')
>>
>> k_lines = list(itertools.islice(f, k))
>>
>> # Next we just iterate over the rest of the file. If we have exhausted
>> # the file, then the loop terminates immediately.
>> for i, line in enumerate(f, start=k):
>>     if randint(0, i) == 0:
>>         k_lines[randint(0, k-1)] = line
>>
>> for line in k_lines:
>>     print line
>>
>> This is my first crack at a solution. I am not sure how close to a 
>> uniform distribution this works out to.
> 
> This looks good too, but I'm not sure where i will start from in the 
> loop. It should start at k + 1.
> 
> But where's the D code you guys? D is better than Python at scripting. 
> Ahem.
> 
> 
> Andrei

No, i should start at k (which it does, that is what the start=k 
parameter to enumerate() is for). This is because the range passed to 
randint() starts at 0. The islice function (like slices in D and indeed 
regular slices in Python) uses an [inclusive, exclusive) range. Thus, 
itertools.islice(f, k) will grab the first k lines from the file. (The 
opening index of 0 is implied.)

islice is a function for "slicing" arbitrary iterable objects, 
particularly ones that can't normally be sliced. By slicing the first k 
lines from the file in this way, we also consume the first k lines from 
the file object, and so the for loop starts on line k, which is why I 
pass start=k to enumerate(). (Remember, I am counting lines from 0.)

-- 
Kirk McDonald



More information about the Digitalmars-d mailing list