random k-sample of a file

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Oct 9 13:30:41 PDT 2008


bearophile wrote:
> Third solution, this requires a storage of k lines (but you can keep this storage on disk):
> 
> from sys import argv
> from random import random, randrange
> # randrange gives a random integer in [0, n)
> 
> filename = argv[1]
> k = int(argv[2])
> assert k > 0
> 
> chosen_lines = []
> for i, line in enumerate(file(filename)):
>     if i < k:
>         chosen_lines.append(line)
>     else:
>         if random() < (1.0 / (i+1)):
>             chosen_lines[randrange(k)] = line
> 
> print chosen_lines

We have a winner!!! There is actually a very simple proof on how and why 
this works.

Andrei




More information about the Digitalmars-d mailing list