random k-sample of a file
Ary Borenszweig
ary at esperanto.org.ar
Fri Oct 10 07:17:44 PDT 2008
Andrei Alexandrescu wrote:
> Ary Borenszweig wrote:
>> Andrei Alexandrescu wrote:
>>> Ary Borenszweig wrote:
>>>> Andrei Alexandrescu escribió:
>>>>> Ary Borenszweig wrote:
>>>>>> Andrei Alexandrescu escribió:
>>>>>>> bearophile wrote:
>>>>>>>> Third solution, this requires a storage of k lines (but you can
>>>>>>>> keep this storage on disk):
>>>>>>>>
>>>>>>>> from sys import argv
>>>>>>>> from random import random, randrange
>>>>>>>> # randrange gives a random integer in [0, n)
>>>>>>>>
>>>>>>>> filename = argv[1]
>>>>>>>> k = int(argv[2])
>>>>>>>> assert k > 0
>>>>>>>>
>>>>>>>> chosen_lines = []
>>>>>>>> for i, line in enumerate(file(filename)):
>>>>>>>> if i < k:
>>>>>>>> chosen_lines.append(line)
>>>>>>>> else:
>>>>>>>> if random() < (1.0 / (i+1)):
>>>>>>>> chosen_lines[randrange(k)] = line
>>>>>>>>
>>>>>>>> print chosen_lines
>>>>>>>
>>>>>>> We have a winner!!! There is actually a very simple proof on how
>>>>>>> and why this works.
>>>>>>
>>>>>> Say you want 2 lines from a file that has 3 lines. Say the lines
>>>>>> are a, b and c.
>>>>>>
>>>>>> What's the probability that c belongs to the result? It's "1.0 /
>>>>>> (i+1)", where i = 2, so 1/3.
>>>>>>
>>>>>> What's the probability that a does not belong to the result? Well,
>>>>>> c must be chosen (thats "1.0 / (i+1)"), and "randrange(k)" must
>>>>>> choose 0. So it's 1/3 * 1/2 = 1/6.
>>>>>>
>>>>>> What's the probability that a belongs to the result? It's 1 - 1/6
>>>>>> = 5/6.
>>>>>>
>>>>>> What am I doing wrong? :-(
>>>>>
>>>>> Nothing except you stop the induction at step 3...
>>>>
>>>> ... which is the last step in this case. There are only three lines.
>>>>
>>>> p(a) = 5/6
>>>> p(b) = 5/6
>>>> p(c) = 1/3
>>>>
>>>> That doesn't seem uniform.
>>>>
>>>> In another post, Kirk says: "Of the remaining 2 out of 3 chances,
>>>> there is a 50% chance the second line will be chosen, and a 50%
>>>> chance of the first line". Why "of the remaining"? It's in that 1
>>>> out of 3 chance, or not?
>>>
>>> Oh, sorry. You need to select c with probability 2.0 / 3.0, not 1.0 /
>>> 3.0. This is because c has the "right" to sit equally in either of
>>> the k positions. If code doesn't do that, there's a bug in it.
>>>
>>> Then probability of a going to Hades is (2.0/3.0) * (1.0/2/0) =
>>> 1.0/3.0, as it should.
>>
>> Ah, ok. It works if you do that. So bearophile's code must say
>>
>> if random() > (1.0 / (i+1)):
>>
>> instead of
>>
>> if random() < (1.0 / (i+1)):
>>
>> (that is, if random() returns a real number between 0 and 1 with
>> uniform distribution)
>
> His prize gets retired. Now how does Kirk's code fare? :o)
I don't have Kirk's code here at work (I told Thunderbird not to
download everything I had in my home)... but if I remember correctly, it
had the same problem, it was something like:
if (0 == uniform(0, k)):
or something like that, and should be 0 != ...
I wonder why both used the wrong probability...
(that's why I made the math, because I wasn't sure with that probability
it worked)
I really liked the problem! It has a very simple and elegant solution...
I thought the probability would be something more complex, but it turned
out to be simple. :-)
More information about the Digitalmars-d
mailing list