random k-sample of a file
Sergey Gromov
snake.scaly at gmail.com
Fri Oct 10 05:11:59 PDT 2008
Thu, 09 Oct 2008 22:28:51 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Thu, 09 Oct 2008 13:56:48 -0500,
> > Andrei Alexandrescu wrote:
> >> I just ran across a nice little problem that I wanted to share as an
> >> exercise for the interested in futile pastimes.
> >>
> >> The challenge, should you accept it, is to write a program that given a
> >> number k and a file, outputs k lines picked uniformly at random from the
> >> file. (If the file has less than k lines, it should all be output.) The
> >> file's size is unbounded so loading it in memory is not an option. How'd
> >> you go about it?
> >
> > Here's mine
> >
> > module k_sample;
> >
> > import std.conv: to;
> > import std.stream: BufferedFile;
> > import std.random: Random, uniform, unpredictableSeed;
> > import std.stdio: writefln;
> >
> > void main(string[] args)
> > {
> > auto f = new BufferedFile(args[1]);
> > auto k = to!(int)(args[2]);
> > assert(k > 0);
> >
> > struct Sample
> > {
> > string text;
> > ulong line;
> > this(string s, ulong n)
> > {
> > text = s;
> > line = n;
> > }
> > }
> >
> > Sample[] samples;
> > auto gen = Random(unpredictableSeed);
> >
> > foreach (ulong n, char[] s; f)
> > {
> > if (samples.length < k)
> > samples ~= Sample(s.idup, n+1);
> > else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
> > samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
> > }
> >
> > foreach (sample; samples)
> > writefln("%8s: %s", sample.line, sample.text);
> > }
> >
> > Usage: k-sample <file> <k>
> >
> > It's easy to prove that a probability of any particular line appearing
> > in the samples is k/n where n is the total number of lines in the text.
>
> Sweet! Suggestion: replace Sample's definition with:
>
> alias Tuple!(string, "text", ulong, "line") Sample;
Good advice, thanks. The Sample definition is unimportant for the
solution but eats up a lot of visual space.
> On a different topic, your use of std.stream reminded me. I was thinking
> of yanking it from Phobos. Thoughts?
While experimenting with Bearophile's loader
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402
I've found out that BufferedFile is the fastest available way to process
a file line-by-line. So the alternative should be as fast.
More information about the Digitalmars-d
mailing list