random k-sample of a file

Sergey Gromov snake.scaly at gmail.com
Thu Oct 9 18:58:29 PDT 2008


Thu, 09 Oct 2008 13:56:48 -0500,
Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an 
> exercise for the interested in futile pastimes.
> 
> The challenge, should you accept it, is to write a program that given a 
> number k and a file, outputs k lines picked uniformly at random from the 
> file. (If the file has less than k lines, it should all be output.) The 
> file's size is unbounded so loading it in memory is not an option. How'd 
> you go about it?

Here's mine

module k_sample;

import std.conv: to;
import std.stream: BufferedFile;
import std.random: Random, uniform, unpredictableSeed;
import std.stdio: writefln;

void main(string[] args)
{
    auto f = new BufferedFile(args[1]);
    auto k = to!(int)(args[2]);
    assert(k > 0);

    struct Sample
    {
        string text;
        ulong line;
        this(string s, ulong n)
        {
            text = s;
            line = n;
        }
    }

    Sample[] samples;
    auto gen = Random(unpredictableSeed);

    foreach (ulong n, char[] s; f)
    {
        if (samples.length < k)
            samples ~= Sample(s.idup, n+1);
        else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
            samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
    }

    foreach (sample; samples)
        writefln("%8s: %s", sample.line, sample.text);
}

Usage: k-sample <file> <k>

It's easy to prove that a probability of any particular line appearing 
in the samples is k/n where n is the total number of lines in the text.



More information about the Digitalmars-d mailing list