random k-sample of a file

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Fri Oct 10 06:49:41 PDT 2008


Ary Borenszweig wrote:
> Andrei Alexandrescu wrote:
>> Ary Borenszweig wrote:
>>> Andrei Alexandrescu escribió:
>>>> Ary Borenszweig wrote:
>>>>> Andrei Alexandrescu escribió:
>>>>>> bearophile wrote:
>>>>>>> Third solution, this requires a storage of k lines (but you can 
>>>>>>> keep this storage on disk):
>>>>>>>
>>>>>>> from sys import argv
>>>>>>> from random import random, randrange
>>>>>>> # randrange gives a random integer in [0, n)
>>>>>>>
>>>>>>> filename = argv[1]
>>>>>>> k = int(argv[2])
>>>>>>> assert k > 0
>>>>>>>
>>>>>>> chosen_lines = []
>>>>>>> for i, line in enumerate(file(filename)):
>>>>>>>     if i < k:
>>>>>>>         chosen_lines.append(line)
>>>>>>>     else:
>>>>>>>         if random() < (1.0 / (i+1)):
>>>>>>>             chosen_lines[randrange(k)] = line
>>>>>>>
>>>>>>> print chosen_lines
>>>>>>
>>>>>> We have a winner!!! There is actually a very simple proof on how 
>>>>>> and why this works.
>>>>>
>>>>> Say you want 2 lines from a file that has 3 lines. Say the lines 
>>>>> are a, b and c.
>>>>>
>>>>> What's the probability that c belongs to the result? It's "1.0 / 
>>>>> (i+1)", where i = 2, so 1/3.
>>>>>
>>>>> What's the probability that a does not belong to the result? Well, 
>>>>> c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must 
>>>>> choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>>
>>>>> What's the probability that a belongs to the result? It's 1 - 1/6 = 
>>>>> 5/6.
>>>>>
>>>>> What am I doing wrong? :-(
>>>>
>>>> Nothing except you stop the induction at step 3...
>>>
>>> ... which is the last step in this case. There are only three lines.
>>>
>>> p(a) = 5/6
>>> p(b) = 5/6
>>> p(c) = 1/3
>>>
>>> That doesn't seem uniform.
>>>
>>> In another post, Kirk says: "Of the remaining 2 out of 3 chances, 
>>> there is a 50% chance the second line will be chosen, and a 50% 
>>> chance of the first line". Why "of the remaining"? It's in that 1 out 
>>> of 3 chance, or not?
>>
>> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 / 
>> 3.0. This is because c has the "right" to sit equally in either of the 
>> k positions. If code doesn't do that, there's a bug in it.
>>
>> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) = 
>> 1.0/3.0, as it should.
> 
> Ah, ok. It works if you do that. So bearophile's code must say
> 
> if random() > (1.0 / (i+1)):
> 
> instead of
> 
> if random() < (1.0 / (i+1)):
> 
> (that is, if random() returns a real number between 0 and 1 with uniform 
> distribution)

His prize gets retired. Now how does Kirk's code fare? :o)

Andrei



More information about the Digitalmars-d mailing list