random k-sample of a file

Thu Oct 9 15:09:41 PDT 2008

Reply to Jason,

> BCS Wrote:
> 
>> Reply to Andrei,
>> 
>>> I just ran across a nice little problem that I wanted to share as an
>>> exercise for the interested in futile pastimes.
>>> 
>>> The challenge, should you accept it, is to write a program that
>>> given a number k and a file, outputs k lines picked uniformly at
>>> random from the file. (If the file has less than k lines, it should
>>> all be output.) The file's size is unbounded so loading it in memory
>>> is not an option. How'd you go about it?
>>> 
>>> Andrei
>>> 
>> struct S
>> {
>> char[] line;
>> real v;
>> int opCmp(S s)
>> {
>> if(s.v < this.v) return -1;
>> else if(s.v > this.v) return +1;
>> return 0;
>> }
>> }
>> int k;
>> S[] buf = new S[k+1];
>> foreach(s;buf) s.v = real.max;
>> 
>> foreach(char[] line; stream)
>> {
>> delete buf[$-1].line;
>> buf[$-1].line = line.dup;;
>> buf[$-1].v = reaqlRand();  // returns v > real.min and v < real.max
>> buf.sort;
>> }
>> foreach(s;buf) if(s.v != real.max; writef("%s\n", s.line);
>> 
> You should check if your new random number is good enough to keep
> before doing the array manipulation and string dup... That small
> change should yield an near-optimal big O. A heap should give slightly
> faster performance for moderate values of k.
> 

I was going for simplicity and ease of reading within a given big-O class.

Switching to insertion sort or reusing data buffers would also do better.