random k-sample of a file
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Thu Oct 9 12:30:30 PDT 2008
dsimcha wrote:
> == Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s article
>> I just ran across a nice little problem that I wanted to share as an
>> exercise for the interested in futile pastimes.
>> The challenge, should you accept it, is to write a program that given a
>> number k and a file, outputs k lines picked uniformly at random from the
>> file. (If the file has less than k lines, it should all be output.) The
>> file's size is unbounded so loading it in memory is not an option. How'd
>> you go about it?
>> Andrei
>
> Pretty simple, actually. Read the file one character at a time. Get the offset
> of the beginning of each line as a ulong or something. Store this information in
> another file essentially as an on-disk array, where, for example, the first
> ulong.sizeof bytes represent the bit pattern for the offset of the first line, the
> next ulong.sizeof bytes represent the bit pattern for the offset of the second
> line, etc. Use plenty of casting tricks.
So this is essentially an index of the file's lines?
> Shuffle the ulong.sizeof chunks of this
> file the same way you would shuffle an ordinary array, using seeking as the
> equivalent to array indexing.
So this would shuffle that index.
> (I never said this was going to be efficient.) For
> each of the first k offsets after shuffling, print everything in the original file
> from that offset to the first EOL character encountered.
Can it be done faster and maybe in one pass?
Andrei
More information about the Digitalmars-d
mailing list