random k-sample of a file

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Oct 9 14:59:15 PDT 2008


Ary Borenszweig wrote:
> Andrei Alexandrescu escribió:
>> bearophile wrote:
>>> Third solution, this requires a storage of k lines (but you can keep 
>>> this storage on disk):
>>>
>>> from sys import argv
>>> from random import random, randrange
>>> # randrange gives a random integer in [0, n)
>>>
>>> filename = argv[1]
>>> k = int(argv[2])
>>> assert k > 0
>>>
>>> chosen_lines = []
>>> for i, line in enumerate(file(filename)):
>>>     if i < k:
>>>         chosen_lines.append(line)
>>>     else:
>>>         if random() < (1.0 / (i+1)):
>>>             chosen_lines[randrange(k)] = line
>>>
>>> print chosen_lines
>>
>> We have a winner!!! There is actually a very simple proof on how and 
>> why this works.
> 
> Say you want 2 lines from a file that has 3 lines. Say the lines are a, 
> b and c.
> 
> What's the probability that c belongs to the result? It's "1.0 / (i+1)", 
> where i = 2, so 1/3.
> 
> What's the probability that a does not belong to the result? Well, c 
> must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. 
> So it's 1/3 * 1/2 = 1/6.
> 
> What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
> 
> What am I doing wrong? :-(

Nothing except you stop the induction at step 3...

Andrei



More information about the Digitalmars-d mailing list