random k-sample of a file

Fri Oct 10 07:55:51 PDT 2008

Andrei Alexandrescu:
> > Ah, ok. It works if you do that. So bearophile's code must say
> > 
> > if random() > (1.0 / (i+1)):
> > 
> > instead of
> > 
> > if random() < (1.0 / (i+1)):
> > 
> > (that is, if random() returns a real number between 0 and 1 with uniform 
> > distribution)
> 
> His prize gets retired. Now how does Kirk's code fare? :o)

I don't like contests...

This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.

The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335

from random import random, randrange
from collections import defaultdict

def main(k, values, freqs, nrepeats):
    for i in xrange(nrepeats):
        chosen = [None] * k
        for i, line in enumerate(values):
            if i < k:
                chosen.append(line)
            else:
                if random() < (1.0 / (i+1)):
                    chosen[randrange(k)] = line

        for c in chosen:
            freqs[c] += 1

    return freqs

import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]

My first version works quite probably, and probably the second one too.

I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.

Bye,
bearophile