random k-sample of a file
bearophile
bearophileHUGS at lycos.com
Thu Oct 9 13:19:06 PDT 2008
Andrei Alexandrescu:
> This is unfair. First line in the original file is included with
> probability k / n. The probability of including the second line is
> dependent on the event of including the first line. If it wasn't
> included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines
> should be given equal chance.
If each of them have the same chance, then you can't guarantee the result has k items. So you must have a biased choice.
You can run this code:
from random import random
from collections import defaultdict
def main(freqs, nrepeats):
for i in xrange(nrepeats):
chosen = []
select = k
remaining = n
for value in values:
if random() < (float(select) / remaining):
chosen.append(value)
select -= 1
remaining -= 1
for c in chosen:
freqs[c] += 1
return freqs
import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]
Its output:
[1549, 1519, 1494, 1467, 1485, 1482, 1481, 1526, 1473, 1523, 1466, 1451, 1475, 1482, 1504, 1551, 1504, 1514, 1510, 1541, 1525, 1550, 1531, 1488, 1529, 1467, 1460, 1506, 1544, 1535, 1455, 1468, 1507, 1505, 1490, 1519, 1489, 1507, 1439, 1561, 1499, 1497, 1534, 1512, 1529, 1442, 1437, 1545, 1547, 1621, 1477, 1525, 1492, 1497, 1492, 1533, 1504, 1482, 1489, 1463, 1479, 1488, 1498, 1502, 1503, 1496, 1468, 1458, 1538, 1536, 1479, 1496, 1477, 1514, 1464, 1526, 1451, 1507, 1527, 1415, 1520, 1476, 1526, 1508, 1479, 1504, 1493, 1466, 1490, 1497, 1472, 1548, 1512, 1535, 1503, 1533, 1467, 1471, 1463, 1526]
It shows the values are balanced.
If that little benchmark isn't enough, my code is based on algorithm S by Knuth, you can take a look on its Art of computer programming for a demonstration that it work correctly.
Bye,
bearophile
More information about the Digitalmars-d
mailing list