random k-sample of a file

Ary Borenszweig ary at esperanto.org.ar
Thu Oct 9 15:04:55 PDT 2008


Andrei Alexandrescu escribió:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu escribió:
>>> bearophile wrote:
>>>> Third solution, this requires a storage of k lines (but you can keep 
>>>> this storage on disk):
>>>>
>>>> from sys import argv
>>>> from random import random, randrange
>>>> # randrange gives a random integer in [0, n)
>>>>
>>>> filename = argv[1]
>>>> k = int(argv[2])
>>>> assert k > 0
>>>>
>>>> chosen_lines = []
>>>> for i, line in enumerate(file(filename)):
>>>>     if i < k:
>>>>         chosen_lines.append(line)
>>>>     else:
>>>>         if random() < (1.0 / (i+1)):
>>>>             chosen_lines[randrange(k)] = line
>>>>
>>>> print chosen_lines
>>>
>>> We have a winner!!! There is actually a very simple proof on how and 
>>> why this works.
>>
>> Say you want 2 lines from a file that has 3 lines. Say the lines are 
>> a, b and c.
>>
>> What's the probability that c belongs to the result? It's "1.0 / 
>> (i+1)", where i = 2, so 1/3.
>>
>> What's the probability that a does not belong to the result? Well, c 
>> must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 
>> 0. So it's 1/3 * 1/2 = 1/6.
>>
>> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
>>
>> What am I doing wrong? :-(
> 
> Nothing except you stop the induction at step 3...

... which is the last step in this case. There are only three lines.

p(a) = 5/6
p(b) = 5/6
p(c) = 1/3

That doesn't seem uniform.

In another post, Kirk says: "Of the remaining 2 out of 3 chances, there 
is a 50% chance the second line will be chosen, and a 50% chance of the 
first line". Why "of the remaining"? It's in that 1 out of 3 chance, or not?

> 
> Andrei



More information about the Digitalmars-d mailing list