random k-sample of a file

Sergey Gromov snake.scaly at gmail.com
Fri Oct 10 05:11:59 PDT 2008


Thu, 09 Oct 2008 22:28:51 -0500,
Andrei Alexandrescu wrote:
> Sergey Gromov wrote:
> > Thu, 09 Oct 2008 13:56:48 -0500,
> > Andrei Alexandrescu wrote:
> >> I just ran across a nice little problem that I wanted to share as an 
> >> exercise for the interested in futile pastimes.
> >>
> >> The challenge, should you accept it, is to write a program that given a 
> >> number k and a file, outputs k lines picked uniformly at random from the 
> >> file. (If the file has less than k lines, it should all be output.) The 
> >> file's size is unbounded so loading it in memory is not an option. How'd 
> >> you go about it?
> > 
> > Here's mine
> > 
> > module k_sample;
> > 
> > import std.conv: to;
> > import std.stream: BufferedFile;
> > import std.random: Random, uniform, unpredictableSeed;
> > import std.stdio: writefln;
> > 
> > void main(string[] args)
> > {
> >     auto f = new BufferedFile(args[1]);
> >     auto k = to!(int)(args[2]);
> >     assert(k > 0);
> > 
> >     struct Sample
> >     {
> >         string text;
> >         ulong line;
> >         this(string s, ulong n)
> >         {
> >             text = s;
> >             line = n;
> >         }
> >     }
> > 
> >     Sample[] samples;
> >     auto gen = Random(unpredictableSeed);
> > 
> >     foreach (ulong n, char[] s; f)
> >     {
> >         if (samples.length < k)
> >             samples ~= Sample(s.idup, n+1);
> >         else if (uniform(gen, 0., 1.) < cast(double) k / (n+1))
> >             samples[uniform(gen, 0, k)] = Sample(s.idup, n+1);
> >     }
> > 
> >     foreach (sample; samples)
> >         writefln("%8s: %s", sample.line, sample.text);
> > }
> > 
> > Usage: k-sample <file> <k>
> > 
> > It's easy to prove that a probability of any particular line appearing 
> > in the samples is k/n where n is the total number of lines in the text.
> 
> Sweet! Suggestion: replace Sample's definition with:
> 
> alias Tuple!(string, "text", ulong, "line") Sample;

Good advice, thanks.  The Sample definition is unimportant for the 
solution but eats up a lot of visual space.

> On a different topic, your use of std.stream reminded me. I was thinking 
> of yanking it from Phobos. Thoughts?

While experimenting with Bearophile's loader
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=13402
I've found out that BufferedFile is the fastest available way to process 
a file line-by-line.  So the alternative should be as fast.



More information about the Digitalmars-d mailing list