random k-sample of a file
Sergey Gromov
snake.scaly at gmail.com
Fri Oct 10 08:25:36 PDT 2008
Fri, 10 Oct 2008 10:55:51 -0400,
bearophile wrote:
> Andrei Alexandrescu:
> > > Ah, ok. It works if you do that. So bearophile's code must say
> > >
> > > if random() > (1.0 / (i+1)):
> > >
> > > instead of
> > >
> > > if random() < (1.0 / (i+1)):
> > >
> > > (that is, if random() returns a real number between 0 and 1 with uniform
> > > distribution)
> >
> > His prize gets retired. Now how does Kirk's code fare? :o)
>
> I don't like contests...
>
> This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.
>
> The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335
>
> from random import random, randrange
> from collections import defaultdict
>
> def main(k, values, freqs, nrepeats):
> for i in xrange(nrepeats):
> chosen = [None] * k
> for i, line in enumerate(values):
> if i < k:
> chosen.append(line)
> else:
> if random() < (1.0 / (i+1)):
> chosen[randrange(k)] = line
>
> for c in chosen:
> freqs[c] += 1
>
> return freqs
>
> import psyco; psyco.full()
> k = 15
> n = 100
> values = range(n)
> freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
> print [freqs.get(i, 0) for i in xrange(n)]
>
>
> My first version works quite probably, and probably the second one too.
>
> I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.
I think your algorithm has one mistake. The last line of your text gets
into the k with probability 1/n while it should be k/n. Replace 1.0
with 1.0*k and the distribution will be absolutely (theoretically)
uniform.
More information about the Digitalmars-d
mailing list