random k-sample of a file

Fri Oct 10 08:25:36 PDT 2008

Fri, 10 Oct 2008 10:55:51 -0400,
bearophile wrote:
> Andrei Alexandrescu:
> > > Ah, ok. It works if you do that. So bearophile's code must say
> > > 
> > > if random() > (1.0 / (i+1)):
> > > 
> > > instead of
> > > 
> > > if random() < (1.0 / (i+1)):
> > > 
> > > (that is, if random() returns a real number between 0 and 1 with uniform 
> > > distribution)
> > 
> > His prize gets retired. Now how does Kirk's code fare? :o)
> 
> I don't like contests...
> 
> This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.
> 
> The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335
> 
> from random import random, randrange
> from collections import defaultdict
> 
> def main(k, values, freqs, nrepeats):
>     for i in xrange(nrepeats):
>         chosen = [None] * k
>         for i, line in enumerate(values):
>             if i < k:
>                 chosen.append(line)
>             else:
>                 if random() < (1.0 / (i+1)):
>                     chosen[randrange(k)] = line
> 
>         for c in chosen:
>             freqs[c] += 1
> 
>     return freqs
> 
> import psyco; psyco.full()
> k = 15
> n = 100
> values = range(n)
> freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
> print [freqs.get(i, 0) for i in xrange(n)]
> 
> 
> My first version works quite probably, and probably the second one too.
> 
> I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.

I think your algorithm has one mistake.  The last line of your text gets 
into the k with probability 1/n while it should be k/n.  Replace 1.0 
with 1.0*k and the distribution will be absolutely (theoretically) 
uniform.