random k-sample of a file
Sergey Gromov
snake.scaly at gmail.com
Thu Oct 9 18:58:29 PDT 2008
Thu, 09 Oct 2008 13:56:48 -0500,
Andrei Alexandrescu wrote:
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from the
> file. (If the file has less than k lines, it should all be output.) The
> file's size is unbounded so loading it in memory is not an option. How'd
> you go about it?
Here's mine
module k_sample;
import std.conv: to;
import std.stream: BufferedFile;
import std.random: Random, uniform, unpredictableSeed;
import std.stdio: writefln;
void main(string[] args)
{
auto f = new BufferedFile(args[1]);
auto k = to!(int)(args[2]);
assert(k > 0);
struct Sample
{
string text;
ulong line;
this(string s, ulong n)
{
text = s;
line = n;
}
}
Sample[] samples;
auto gen = Random(unpredictableSeed);
foreach (ulong n, char[] s; f)
{
if (samples.length < k)
samples ~= Sample(s.idup, n+1);
else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
}
foreach (sample; samples)
writefln("%8s: %s", sample.line, sample.text);
}
Usage: k-sample <file> <k>
It's easy to prove that a probability of any particular line appearing
in the samples is k/n where n is the total number of lines in the text.
More information about the Digitalmars-d
mailing list