random k-sample of a file

Kirk McDonald kirklin.mcdonald at gmail.com
Thu Oct 9 15:29:05 PDT 2008


Ary Borenszweig wrote:
> Andrei Alexandrescu escribió:
>> Ary Borenszweig wrote:
>>> Andrei Alexandrescu escribió:
>>>> bearophile wrote:
>>>>> Third solution, this requires a storage of k lines (but you can 
>>>>> keep this storage on disk):
>>>>>
>>>>> from sys import argv
>>>>> from random import random, randrange
>>>>> # randrange gives a random integer in [0, n)
>>>>>
>>>>> filename = argv[1]
>>>>> k = int(argv[2])
>>>>> assert k > 0
>>>>>
>>>>> chosen_lines = []
>>>>> for i, line in enumerate(file(filename)):
>>>>>     if i < k:
>>>>>         chosen_lines.append(line)
>>>>>     else:
>>>>>         if random() < (1.0 / (i+1)):
>>>>>             chosen_lines[randrange(k)] = line
>>>>>
>>>>> print chosen_lines
>>>>
>>>> We have a winner!!! There is actually a very simple proof on how and 
>>>> why this works.
>>>
>>> Say you want 2 lines from a file that has 3 lines. Say the lines are 
>>> a, b and c.
>>>
>>> What's the probability that c belongs to the result? It's "1.0 / 
>>> (i+1)", where i = 2, so 1/3.
>>>
>>> What's the probability that a does not belong to the result? Well, c 
>>> must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
>>> 0. So it's 1/3 * 1/2 = 1/6.
>>>
>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>>
>>> What am I doing wrong? :-(
>>
>> Nothing except you stop the induction at step 3...
> 
> ... which is the last step in this case. There are only three lines.
> 
> p(a) = 5/6
> p(b) = 5/6
> p(c) = 1/3
> 
> That doesn't seem uniform.
> 
> In another post, Kirk says: "Of the remaining 2 out of 3 chances, there 
> is a 50% chance the second line will be chosen, and a 50% chance of the 
> first line". Why "of the remaining"? It's in that 1 out of 3 chance, or 
> not?
> 

To reiterate the code I was talking about (the solution for k == 1):

from random import randint

chosen_line = None

for i, line in enumerate(open('filename')):
     if randint(0, i) == 0:
         chosen_line = line

print chosen_line

Assume there are three lines in the file. The algorithm churns through 
to the last line. We have the following state:

i == 2
chosen_line = <one of the first two lines>

Then we call randint(0, 2). This gives us a 1 out of 3 chance of 
chosen_line being the last line (if the random number is 0). Therefore, 
there is also a 2 out of 3 chance of it being some *other* line (if the 
random number is 1 or 2).

It turns out there is an equal chance for it being either of the two 
lines. The algorithm starts on the first line, and does randint(0, 0), 
meaning chosen_line is always set to the first line initially. On the 
second line, it does randint(0, 1), which means it has a 50% chance of 
changing chosen_line to the second line.

The math continues to work out in this way as the number of lines 
increases without bound.

-- 
Kirk McDonald



More information about the Digitalmars-d mailing list