random k-sample of a file

dsimcha dsimcha at yahoo.com
Thu Oct 9 12:09:03 PDT 2008


== Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s article
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from the
> file. (If the file has less than k lines, it should all be output.) The
> file's size is unbounded so loading it in memory is not an option. How'd
> you go about it?
> Andrei

Pretty simple, actually.  Read the file one character at a time.  Get the offset
of the beginning of each line as a ulong or something.  Store this information in
another file essentially as an on-disk array, where, for example, the first
ulong.sizeof bytes represent the bit pattern for the offset of the first line, the
next ulong.sizeof bytes represent the bit pattern for the offset of the second
line, etc.  Use plenty of casting tricks.  Shuffle the ulong.sizeof chunks of this
file the same way you would shuffle an ordinary array, using seeking as the
equivalent to array indexing.  (I never said this was going to be efficient.)  For
each of the first k offsets after shuffling, print everything in the original file
from that offset to the first EOL character encountered.




More information about the Digitalmars-d mailing list