random k-sample of a file
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Thu Oct 9 20:28:51 PDT 2008
Sergey Gromov wrote:
> Thu, 09 Oct 2008 13:56:48 -0500,
> Andrei Alexandrescu wrote:
>> I just ran across a nice little problem that I wanted to share as an
>> exercise for the interested in futile pastimes.
>>
>> The challenge, should you accept it, is to write a program that given a
>> number k and a file, outputs k lines picked uniformly at random from the
>> file. (If the file has less than k lines, it should all be output.) The
>> file's size is unbounded so loading it in memory is not an option. How'd
>> you go about it?
>
> Here's mine
>
> module k_sample;
>
> import std.conv: to;
> import std.stream: BufferedFile;
> import std.random: Random, uniform, unpredictableSeed;
> import std.stdio: writefln;
>
> void main(string[] args)
> {
> auto f = new BufferedFile(args[1]);
> auto k = to!(int)(args[2]);
> assert(k > 0);
>
> struct Sample
> {
> string text;
> ulong line;
> this(string s, ulong n)
> {
> text = s;
> line = n;
> }
> }
>
> Sample[] samples;
> auto gen = Random(unpredictableSeed);
>
> foreach (ulong n, char[] s; f)
> {
> if (samples.length < k)
> samples ~= Sample(s.idup, n+1);
> else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
> samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
> }
>
> foreach (sample; samples)
> writefln("%8s: %s", sample.line, sample.text);
> }
>
> Usage: k-sample <file> <k>
>
> It's easy to prove that a probability of any particular line appearing
> in the samples is k/n where n is the total number of lines in the text.
Sweet! Suggestion: replace Sample's definition with:
alias Tuple!(string, "text", ulong, "line") Sample;
On a different topic, your use of std.stream reminded me. I was thinking
of yanking it from Phobos. Thoughts?
Andrei
More information about the Digitalmars-d
mailing list