random k-sample of a file

Thu Oct 9 12:30:30 PDT 2008

dsimcha wrote:
> == Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s article
>> I just ran across a nice little problem that I wanted to share as an
>> exercise for the interested in futile pastimes.
>> The challenge, should you accept it, is to write a program that given a
>> number k and a file, outputs k lines picked uniformly at random from the
>> file. (If the file has less than k lines, it should all be output.) The
>> file's size is unbounded so loading it in memory is not an option. How'd
>> you go about it?
>> Andrei
> 
> Pretty simple, actually.  Read the file one character at a time.  Get the offset
> of the beginning of each line as a ulong or something.  Store this information in
> another file essentially as an on-disk array, where, for example, the first
> ulong.sizeof bytes represent the bit pattern for the offset of the first line, the
> next ulong.sizeof bytes represent the bit pattern for the offset of the second
> line, etc.  Use plenty of casting tricks.

So this is essentially an index of the file's lines?

> Shuffle the ulong.sizeof chunks of this
> file the same way you would shuffle an ordinary array, using seeking as the
> equivalent to array indexing.

So this would shuffle that index.

> (I never said this was going to be efficient.)  For
> each of the first k offsets after shuffling, print everything in the original file
> from that offset to the first EOL character encountered.

Can it be done faster and maybe in one pass?

Andrei