random k-sample of a file

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Oct 9 15:35:00 PDT 2008


Kirk McDonald wrote:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu escribió:
>>> Ary Borenszweig wrote:
>>>> Andrei Alexandrescu escribió:
>>>>> bearophile wrote:
>>>>>> Third solution, this requires a storage of k lines (but you can 
>>>>>> keep this storage on disk):
>>>>>>
>>>>>> from sys import argv
>>>>>> from random import random, randrange
>>>>>> # randrange gives a random integer in [0, n)
>>>>>>
>>>>>> filename = argv[1]
>>>>>> k = int(argv[2])
>>>>>> assert k > 0
>>>>>>
>>>>>> chosen_lines = []
>>>>>> for i, line in enumerate(file(filename)):
>>>>>>     if i < k:
>>>>>>         chosen_lines.append(line)
>>>>>>     else:
>>>>>>         if random() < (1.0 / (i+1)):
>>>>>>             chosen_lines[randrange(k)] = line
>>>>>>
>>>>>> print chosen_lines
>>>>>
>>>>> We have a winner!!! There is actually a very simple proof on how 
>>>>> and why this works.
>>>>
>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are 
>>>> a, b and c.
>>>>
>>>> What's the probability that c belongs to the result? It's "1.0 / 
>>>> (i+1)", where i = 2, so 1/3.
>>>>
>>>> What's the probability that a does not belong to the result? Well, c 
>>>> must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
>>>> 0. So it's 1/3 * 1/2 = 1/6.
>>>>
>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 
>>>> 5/6.
>>>>
>>>> What am I doing wrong? :-(
>>>
>>> Nothing except you stop the induction at step 3...
>>
>> ... which is the last step in this case. There are only three lines.
>>
>> p(a) = 5/6
>> p(b) = 5/6
>> p(c) = 1/3
>>
>> That doesn't seem uniform.
>>
>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, 
>> there is a 50% chance the second line will be chosen, and a 50% chance 
>> of the first line". Why "of the remaining"? It's in that 1 out of 3 
>> chance, or not?
>>
> 
> To reiterate the code I was talking about (the solution for k == 1):
> 
> from random import randint
> 
> chosen_line = None
> 
> for i, line in enumerate(open('filename')):
>     if randint(0, i) == 0:
>         chosen_line = line
> 
> print chosen_line
> 
> Assume there are three lines in the file. The algorithm churns through 
> to the last line. We have the following state:
> 
> i == 2
> chosen_line = <one of the first two lines>
> 
> Then we call randint(0, 2). This gives us a 1 out of 3 chance of 
> chosen_line being the last line (if the random number is 0). Therefore, 
> there is also a 2 out of 3 chance of it being some *other* line (if the 
> random number is 1 or 2).
> 
> It turns out there is an equal chance for it being either of the two 
> lines. The algorithm starts on the first line, and does randint(0, 0), 
> meaning chosen_line is always set to the first line initially. On the 
> second line, it does randint(0, 1), which means it has a 50% chance of 
> changing chosen_line to the second line.
> 
> The math continues to work out in this way as the number of lines 
> increases without bound.

That's why I thought it's such a beautiful problem. My proof is 
inductive over n. The invariant is that the k lines held represent a 
uniform selection of the n lines seen so far at any moment.

Base: For n <= k, vacuously satisfied.

Inductive step: We have a random selection of n lines so far in array 
selection, and we add one more line. We need to prove that the new 
selection is also uniform. The new line has probability k / (n + 1) of 
being part of the selection. We toss a skewed coin with that 
probability. If it turns heads, we pick the new element for keeping. Any 
element in the array should go with equal probability (by the inductive 
assumption), so we roll a k-sided dice to choose one element to throw away.


Andrei



More information about the Digitalmars-d mailing list