random k-sample of a file

Thu Oct 9 13:19:06 PDT 2008

Andrei Alexandrescu:
> This is unfair. First line in the original file is included with 
> probability k / n. The probability of including the second line is 
> dependent on the event of including the first line. If it wasn't 
> included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines 
> should be given equal chance.

If each of them have the same chance, then you can't guarantee the result has k items. So you must have a biased choice.

You can run this code:

from random import random
from collections import defaultdict

def main(freqs, nrepeats):
    for i in xrange(nrepeats):
        chosen = []
        select = k
        remaining = n

        for value in values:
            if random() < (float(select) / remaining):
                chosen.append(value)
                select -= 1
            remaining -= 1

        for c in chosen:
            freqs[c] += 1

    return freqs

import psyco; psyco.full()
k = 15
n = 100
values = range(n)
freqs = main(freqs=defaultdict(int), nrepeats=10000)
print [freqs.get(i, 0) for i in xrange(n)]

Its output:

[1549, 1519, 1494, 1467, 1485, 1482, 1481, 1526, 1473, 1523, 1466, 1451, 1475, 1482, 1504, 1551, 1504, 1514, 1510, 1541, 1525, 1550, 1531, 1488, 1529, 1467, 1460, 1506, 1544, 1535, 1455, 1468, 1507, 1505, 1490, 1519, 1489, 1507, 1439, 1561, 1499, 1497, 1534, 1512, 1529, 1442, 1437, 1545, 1547, 1621, 1477, 1525, 1492, 1497, 1492, 1533, 1504, 1482, 1489, 1463, 1479, 1488, 1498, 1502, 1503, 1496, 1468, 1458, 1538, 1536, 1479, 1496, 1477, 1514, 1464, 1526, 1451, 1507, 1527, 1415, 1520, 1476, 1526, 1508, 1479, 1504, 1493, 1466, 1490, 1497, 1472, 1548, 1512, 1535, 1503, 1533, 1467, 1471, 1463, 1526]

It shows the values are balanced.

If that little benchmark isn't enough, my code is based on algorithm S by Knuth, you can take a look on its Art of computer programming for a demonstration that it work correctly.

Bye,
bearophile