random k-sample of a file

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Oct 9 20:28:51 PDT 2008


Sergey Gromov wrote:
> Thu, 09 Oct 2008 13:56:48 -0500,
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an 
>> exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a 
>> number k and a file, outputs k lines picked uniformly at random from the 
>> file. (If the file has less than k lines, it should all be output.) The 
>> file's size is unbounded so loading it in memory is not an option. How'd 
>> you go about it?
> 
> Here's mine
> 
> module k_sample;
> 
> import std.conv: to;
> import std.stream: BufferedFile;
> import std.random: Random, uniform, unpredictableSeed;
> import std.stdio: writefln;
> 
> void main(string[] args)
> {
>     auto f = new BufferedFile(args[1]);
>     auto k = to!(int)(args[2]);
>     assert(k > 0);
> 
>     struct Sample
>     {
>         string text;
>         ulong line;
>         this(string s, ulong n)
>         {
>             text = s;
>             line = n;
>         }
>     }
> 
>     Sample[] samples;
>     auto gen = Random(unpredictableSeed);
> 
>     foreach (ulong n, char[] s; f)
>     {
>         if (samples.length < k)
>             samples ~= Sample(s.idup, n+1);
>         else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
>             samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
>     }
> 
>     foreach (sample; samples)
>         writefln("%8s: %s", sample.line, sample.text);
> }
> 
> Usage: k-sample <file> <k>
> 
> It's easy to prove that a probability of any particular line appearing 
> in the samples is k/n where n is the total number of lines in the text.

Sweet! Suggestion: replace Sample's definition with:

alias Tuple!(string, "text", ulong, "line") Sample;

On a different topic, your use of std.stream reminded me. I was thinking 
of yanking it from Phobos. Thoughts?


Andrei



More information about the Digitalmars-d mailing list