random k-sample of a file

Thu Oct 9 12:44:43 PDT 2008

bearophile wrote:
> I am not reading the anwers written by others, of course :-) With the help of "Programming Pearls" here is my second version, that spares the memory required for the chosen ones, so this code runs with very little memory:
> 
> from sys import argv
> from random import random
> 
> filename = argv[1]
> k = int(argv[2])
> nlines = sum(1 for _ in file(filename))
> 
> if k >= nlines:
>     for line in file(filename):
>         print line
> else:
>     select = k
>     remaining = nlines
> 
>     for line in file(filename):
>         if random() < float(select) / remaining:
>             print line
>             select -= 1
>         remaining -= 1
> 
> I'll think for a solution that avoids reading the file twice then...
> 
> Bye,
> bearophile

This is unfair. First line in the original file is included with 
probability k / n. The probability of including the second line is 
dependent on the event of including the first line. If it wasn't 
included, it's k / (n - 1), otherwise it's (k - 1) / (n - 1). All lines 
should be given equal chance.

Andrei