Random sampling next steps
Joseph Rushton Wakeling
joseph.wakeling at webdrake.net
Mon Jul 23 07:31:16 PDT 2012
Hello all,
At the risk of sounding like a one-trick pony, some thoughts on RandomSample and
what remains to be done with it.
Besides algorithmic improvements, the recent updates from Martin Nowak and
myself have fixed a couple of key bugs:
http://d.puremagic.com/issues/show_bug.cgi?id=7936
http://d.puremagic.com/issues/show_bug.cgi?id=8314
... but we still have a couple of important outstanding issues:
http://d.puremagic.com/issues/show_bug.cgi?id=7067
http://d.puremagic.com/issues/show_bug.cgi?id=8247
The solution here would seem to be identical and involve not changing
RandomSample but changing the random number generators to being reference types;
cf. discussion in those bug reports, and also here:
http://forum.dlang.org/thread/4FD735EB.70404@webdrake.net
This is a controversial issue as it would involve a breaking change to D's RNG
design, but broadly speaking seems like a necessary breaking change to avoid
these kinds of issues cropping up consistently across various different
applications. Indeed, it seems likely that only flawed code, involving passing
RNG by value, would have an issue with this change, and there may also be
positive effects for future development of other parts of std.random.
However, the issue I'd like to dwell on here is the one raised by Craig
Dillabaugh in comments on my blog. The current RandomSample implementation
covers only the case where the total number of items to pick from is known in
advance. In other words, either your input range needs hasLength!R == true or
you need to manually specify the total number of items when calling randomSample:
auto randomSample(R)(R r, size_t n, size_t total)
if(isInputRange!R)
{
return RandomSample!(R, void)(r, n, total);
}
auto randomSample(R)(R r, size_t n)
if(isInputRange!R && hasLength!R)
{
return RandomSample!(R, void)(r, n, r.length);
}
... and similarly in the constructor of RandomSample:
static if (hasLength!R)
this(R input, size_t howMany)
{
this(input, howMany, input.length);
}
this(R input, size_t howMany, size_t total)
{
_input = input;
_available = total;
_toSelect = howMany;
enforce(_toSelect <= _available);
_first = true;
}
But what if the total number of items is not known in advance? E.g. if you are
reading a file, line by line, or reading records from a tape; you may know the
total is finite, but you don't know what it actually _is_.
There are dedicated algorithms for this, via the technique known as "reservoir
sampling". Knuth lists Algorithm R; Vitter has defined an improved technique,
Algorithm Z. These should be fairly easy to implement in D, but the question is
how to integrate them with the existing random sampling functionality.
The simplest way would be to just implement an additional struct and function,
say, RandomReservoirSample and randomReservoirSample. However, I was wondering
about a setup where there is a single interface that looks at the input range
and other data you have provided and makes an intelligent choice of the optimal
algorithm. It's this I'd like advice on.
One solution would be to have two different underlying structs which are
selected depending on provided data:
auto randomSample(R)(R r, size_t n)
if(isInputRange!R && hasLength!R)
{
return RandomSample!(R, void)(r, n, r.length);
}
auto randomSample(R)(R r, size_t n)
if(isInputRange!R && !hasLength!R)
{
return RandomReservoirSample!(R, void)(r, n);
}
... but doing something similar within RandomSample doesn't seem so easy. Why?
Because the static if()s that you'd require within the struct would not depend
just on whether hasLength!R == true, but also on whether you'd passed a size_t
total to the constructor. I'm not sure how to factor in the latter case, even
though it's clearly a compile-time issue. Can anyone advise?
Basically, I want to be able to say: static if(!hasLength!R &&
no_total_was_passed) then certain functions or certain variables are or aren't
defined.
I also think it would be a good idea for the reservoir sampling technique to
emit a warning when in debug mode, to prompt the user to be _sure_ that they
can't specify the total number of points to sample from. Is there a recommended
method of doing something like this?
Alternatively, would people prefer to entirely separate the known-total and
unknown-total sampling methods entirely, so the choice is always manual?
Finally, if hasLength!R == false, is there any way of guaranteeing that the
input range is still going to be ultimately finite? There could be some very
nasty worst-case behaviour in the case of infinite ranges.
I'm asking this here rather than d-learn as it's related to design choices for
Phobos, but I'll happily move the discussion over if necessary.
Thanks and best wishes,
-- Joe
More information about the Digitalmars-d
mailing list