random k-sample of a file

bearophile bearophileHUGS at lycos.com
Thu Oct 9 12:19:44 PDT 2008


Andrei Alexandrescu:
> I just ran across a nice little problem that I wanted to share as an 
> exercise for the interested in futile pastimes.
> The challenge, should you accept it, is to write a program that given a 
> number k and a file, outputs k lines picked uniformly at random from the 
> file. (If the file has less than k lines, it should all be output.) The 
> file's size is unbounded so loading it in memory is not an option. How'd 
> you go about it?

It's an interesting problem, I can offer two solutions, the first one is the one I use in practice, that is use a scripting language (but the same exact program can be written in D, using my libs it becomes very similar) that loads the file two times, the first to count the lines, then creates an array of randomly chosen indexes (without replacement!) and uses them to print the chosen ones: 

from sys import argv
from random import sample

filename = argv[1]
k = int(argv[2])
nlines = sum(1 for _ in file(filename))

if k >= nlines:
    for line in file(filename):
        print line
else:
    samples = set(sample(xrange(nlines), k))
    for i, line in enumerate(file(filename)):
        if i in samples:
            print line

If your practical tests show you that reading your file twice is too much slow, then you may want to look for a more complex solution.

I'll think about the more complex problem in the next post.

Later,
bearophile



More information about the Digitalmars-d mailing list