random k-sample of a file

Ary Borenszweig ary at esperanto.org.ar
Thu Oct 9 14:19:30 PDT 2008


Andrei Alexandrescu escribió:
> bearophile wrote:
>> Third solution, this requires a storage of k lines (but you can keep 
>> this storage on disk):
>>
>> from sys import argv
>> from random import random, randrange
>> # randrange gives a random integer in [0, n)
>>
>> filename = argv[1]
>> k = int(argv[2])
>> assert k > 0
>>
>> chosen_lines = []
>> for i, line in enumerate(file(filename)):
>>     if i < k:
>>         chosen_lines.append(line)
>>     else:
>>         if random() < (1.0 / (i+1)):
>>             chosen_lines[randrange(k)] = line
>>
>> print chosen_lines
> 
> We have a winner!!! There is actually a very simple proof on how and why 
> this works.

Say you want 2 lines from a file that has 3 lines. Say the lines are a, 
b and c.

What's the probability that c belongs to the result? It's "1.0 / (i+1)", 
where i = 2, so 1/3.

What's the probability that a does not belong to the result? Well, c 
must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0. 
So it's 1/3 * 1/2 = 1/6.

What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.

What am I doing wrong? :-(

> 
> Andrei
> 



More information about the Digitalmars-d mailing list