random k-sample of a file
Michel Fortin
michel.fortin at michelf.com
Fri Oct 10 04:50:17 PDT 2008
On 2008-10-09 14:56:48 -0400, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> I just ran across a nice little problem that I wanted to share as an
> exercise for the interested in futile pastimes.
>
> The challenge, should you accept it, is to write a program that given a
> number k and a file, outputs k lines picked uniformly at random from
> the file. (If the file has less than k lines, it should all be output.)
> The file's size is unbounded so loading it in memory is not an option.
> How'd you go about it?
My attempt. Works in one pass, and distributes uniformly but only when
k is a factor of the number of lines in the file. At least, I believe
the distribution is uniform, but feel free to prove me wrong.
/* Uniform random value r, with r >= min && r < max. */
real random(uint min, uint max) { ... }
/* Shuffle elements uniformly in the given string array. */
void shuffle(ref string[] array) { ... }
string[] challenge(alias R)(R input, uint k)
{
string[] result;
uint line;
real pickFactor = 1;
// Initial stage: fill initial k strings
while (!input.eof && line < k)
{
result ~= input.readLine;
++line;
}
shuffle(result);
while (!input.eof)
{
// Next stage: half the chances of picking a candidate.
pickFactor /= 2;
string[] candidates;
line = 0;
while (!input.eof && line < k)
{
candidates ~= input.readLine;
++line;
}
shuffle(candidates);
for (uint i; i < candidates.length; ++i)
if (random(0, 1) < pickFactor)
result[i] = candidates[i];
// Note: if candidate.length != result.length
// we don't have uniformity.
}
return result;
}
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list