random k-sample of a file

Michel Fortin michel.fortin at michelf.com
Fri Oct 10 04:50:17 PDT 2008


On 2008-10-09 14:56:48 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> I just ran across a nice little problem that I wanted to share as an 
> exercise for the interested in futile pastimes.
> 
> The challenge, should you accept it, is to write a program that given a 
> number k and a file, outputs k lines picked uniformly at random from 
> the file. (If the file has less than k lines, it should all be output.) 
> The file's size is unbounded so loading it in memory is not an option. 
> How'd you go about it?

My attempt. Works in one pass, and distributes uniformly but only when 
k is a factor of the number of lines in the file. At least, I believe 
the distribution is uniform, but feel free to prove me wrong.

/* Uniform random value r, with r >= min && r < max. */
real random(uint min, uint max) { ... }

/* Shuffle elements uniformly in the given string array. */
void shuffle(ref string[] array) { ... }

string[] challenge(alias R)(R input, uint k)
{
	string[] result;
	uint line;
	real pickFactor = 1;

	// Initial stage: fill initial k strings
	while (!input.eof && line < k)
	{
		result ~= input.readLine;
		++line;
	}
	shuffle(result);

	while (!input.eof)
	{
		// Next stage: half the chances of picking a candidate.
		pickFactor /= 2;
		string[] candidates;
		line = 0;

		while (!input.eof && line < k)
		{
			candidates ~= input.readLine;
			++line;
		}
		shuffle(candidates);

		for (uint i; i < candidates.length; ++i)
			if (random(0, 1) < pickFactor)
				result[i] = candidates[i];
		// Note: if candidate.length != result.length
		// we don't have uniformity.
	}
	return result;
}

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/




More information about the Digitalmars-d mailing list