random k-sample of a file
Ary Borenszweig
ary at esperanto.org.ar
Thu Oct 9 14:19:30 PDT 2008
Andrei Alexandrescu escribió:
> bearophile wrote:
>> Third solution, this requires a storage of k lines (but you can keep
>> this storage on disk):
>>
>> from sys import argv
>> from random import random, randrange
>> # randrange gives a random integer in [0, n)
>>
>> filename = argv[1]
>> k = int(argv[2])
>> assert k > 0
>>
>> chosen_lines = []
>> for i, line in enumerate(file(filename)):
>> if i < k:
>> chosen_lines.append(line)
>> else:
>> if random() < (1.0 / (i+1)):
>> chosen_lines[randrange(k)] = line
>>
>> print chosen_lines
>
> We have a winner!!! There is actually a very simple proof on how and why
> this works.
Say you want 2 lines from a file that has 3 lines. Say the lines are a,
b and c.
What's the probability that c belongs to the result? It's "1.0 / (i+1)",
where i = 2, so 1/3.
What's the probability that a does not belong to the result? Well, c
must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must choose 0.
So it's 1/3 * 1/2 = 1/6.
What's the probability that a belongs to the result? It's 1 - 1/6 = 5/6.
What am I doing wrong? :-(
>
> Andrei
>
More information about the Digitalmars-d
mailing list