random k-sample of a file

Ary Borenszweig ary at esperanto.org.ar
Fri Oct 10 06:48:53 PDT 2008


Andrei Alexandrescu wrote:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu escribió:
>>> Ary Borenszweig wrote:
>>>> Andrei Alexandrescu escribió:
>>>>> bearophile wrote:
>>>>>> Third solution, this requires a storage of k lines (but you can 
>>>>>> keep this storage on disk):
>>>>>>
>>>>>> from sys import argv
>>>>>> from random import random, randrange
>>>>>> # randrange gives a random integer in [0, n)
>>>>>>
>>>>>> filename = argv[1]
>>>>>> k = int(argv[2])
>>>>>> assert k > 0
>>>>>>
>>>>>> chosen_lines = []
>>>>>> for i, line in enumerate(file(filename)):
>>>>>>     if i < k:
>>>>>>         chosen_lines.append(line)
>>>>>>     else:
>>>>>>         if random() < (1.0 / (i+1)):
>>>>>>             chosen_lines[randrange(k)] = line
>>>>>>
>>>>>> print chosen_lines
>>>>>
>>>>> We have a winner!!! There is actually a very simple proof on how 
>>>>> and why this works.
>>>>
>>>> Say you want 2 lines from a file that has 3 lines. Say the lines are 
>>>> a, b and c.
>>>>
>>>> What's the probability that c belongs to the result? It's "1.0 / 
>>>> (i+1)", where i = 2, so 1/3.
>>>>
>>>> What's the probability that a does not belong to the result? Well, c 
>>>> must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
>>>> 0. So it's 1/3 * 1/2 = 1/6.
>>>>
>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 
>>>> 5/6.
>>>>
>>>> What am I doing wrong? :-(
>>>
>>> Nothing except you stop the induction at step 3...
>>
>> ... which is the last step in this case. There are only three lines.
>>
>> p(a) = 5/6
>> p(b) = 5/6
>> p(c) = 1/3
>>
>> That doesn't seem uniform.
>>
>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, 
>> there is a 50% chance the second line will be chosen, and a 50% chance 
>> of the first line". Why "of the remaining"? It's in that 1 out of 3 
>> chance, or not?
> 
> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 
> 3.0. This is because c has the "right" to sit equally in either of the k 
> positions. If code doesn't do that, there's a bug in it.
> 
> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 1.0/3.0, 
> as it should.

Ah, ok. It works if you do that. So bearophile's code must say

if random() > (1.0 / (i+1)):

instead of

if random() < (1.0 / (i+1)):

(that is, if random() returns a real number between 0 and 1 with uniform 
distribution)

> 
> 
> Andrei



More information about the Digitalmars-d mailing list