random k-sample of a file
bearophile
bearophileHUGS at lycos.com
Fri Oct 10 07:55:51 PDT 2008
Andrei Alexandrescu:
> > Ah, ok. It works if you do that. So bearophile's code must say
> >
> > if random() > (1.0 / (i+1)):
> >
> > instead of
> >
> > if random() < (1.0 / (i+1)):
> >
> > (that is, if random() returns a real number between 0 and 1 with uniform
> > distribution)
>
> His prize gets retired. Now how does Kirk's code fare? :o)
I don't like contests...
This simple benchmark shows that my third version is wrong in both ways, and it keeps being wrong even if you remove the chosen.append(line), etc.
The problem isn't easy, this paper shows that it's not easy to create an efficient implementation even when the total line count is known:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.4335
from random import random, randrange
from collections import defaultdict
def main(k, values, freqs, nrepeats):
for i in xrange(nrepeats):
chosen = [None] * k
for i, line in enumerate(values):
if i < k:
chosen.append(line)
else:
if random() < (1.0 / (i+1)):
chosen[randrange(k)] = line
for c in chosen:
freqs[c] += 1
return freqs
import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(k, values, freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]
My first version works quite probably, and probably the second one too.
I think there are ways to solve this problem keeping only k lines, but the code isn't obvious.
Bye,
bearophile
More information about the Digitalmars-d
mailing list