Random sampling next steps

Mon Jul 23 07:31:16 PDT 2012

Hello all,

At the risk of sounding like a one-trick pony, some thoughts on RandomSample and 
what remains to be done with it.

Besides algorithmic improvements, the recent updates from Martin Nowak and 
myself have fixed a couple of key bugs:
http://d.puremagic.com/issues/show_bug.cgi?id=7936
http://d.puremagic.com/issues/show_bug.cgi?id=8314

... but we still have a couple of important outstanding issues:
http://d.puremagic.com/issues/show_bug.cgi?id=7067
http://d.puremagic.com/issues/show_bug.cgi?id=8247

The solution here would seem to be identical and involve not changing 
RandomSample but changing the random number generators to being reference types; 
cf. discussion in those bug reports, and also here:
http://forum.dlang.org/thread/4FD735EB.70404@webdrake.net

This is a controversial issue as it would involve a breaking change to D's RNG 
design, but broadly speaking seems like a necessary breaking change to avoid 
these kinds of issues cropping up consistently across various different 
applications.  Indeed, it seems likely that only flawed code, involving passing 
RNG by value, would have an issue with this change, and there may also be 
positive effects for future development of other parts of std.random.

However, the issue I'd like to dwell on here is the one raised by Craig 
Dillabaugh in comments on my blog.  The current RandomSample implementation 
covers only the case where the total number of items to pick from is known in 
advance.  In other words, either your input range needs hasLength!R == true or 
you need to manually specify the total number of items when calling randomSample:

     auto randomSample(R)(R r, size_t n, size_t total)
     if(isInputRange!R)
     {
         return RandomSample!(R, void)(r, n, total);
     }

     auto randomSample(R)(R r, size_t n)
     if(isInputRange!R && hasLength!R)
     {
         return RandomSample!(R, void)(r, n, r.length);
     }

... and similarly in the constructor of RandomSample:

     static if (hasLength!R)
         this(R input, size_t howMany)
         {
             this(input, howMany, input.length);
         }

     this(R input, size_t howMany, size_t total)
     {
         _input = input;
         _available = total;
         _toSelect = howMany;
         enforce(_toSelect <= _available);
         _first = true;
     }

But what if the total number of items is not known in advance?  E.g. if you are 
reading a file, line by line, or reading records from a tape; you may know the 
total is finite, but you don't know what it actually _is_.

There are dedicated algorithms for this, via the technique known as "reservoir 
sampling".  Knuth lists Algorithm R; Vitter has defined an improved technique, 
Algorithm Z.  These should be fairly easy to implement in D, but the question is 
how to integrate them with the existing random sampling functionality.

The simplest way would be to just implement an additional struct and function, 
say, RandomReservoirSample and randomReservoirSample.  However, I was wondering 
about a setup where there is a single interface that looks at the input range 
and other data you have provided and makes an intelligent choice of the optimal 
algorithm.  It's this I'd like advice on.

One solution would be to have two different underlying structs which are 
selected depending on provided data:

     auto randomSample(R)(R r, size_t n)
     if(isInputRange!R && hasLength!R)
     {
         return RandomSample!(R, void)(r, n, r.length);
     }

     auto randomSample(R)(R r, size_t n)
     if(isInputRange!R && !hasLength!R)
     {
         return RandomReservoirSample!(R, void)(r, n);
     }

... but doing something similar within RandomSample doesn't seem so easy.  Why? 
  Because the static if()s that you'd require within the struct would not depend 
just on whether hasLength!R == true, but also on whether you'd passed a size_t 
total to the constructor.  I'm not sure how to factor in the latter case, even 
though it's clearly a compile-time issue.  Can anyone advise?

Basically, I want to be able to say: static if(!hasLength!R && 
no_total_was_passed) then certain functions or certain variables are or aren't 
defined.

I also think it would be a good idea for the reservoir sampling technique to 
emit a warning when in debug mode, to prompt the user to be _sure_ that they 
can't specify the total number of points to sample from.  Is there a recommended 
method of doing something like this?

Alternatively, would people prefer to entirely separate the known-total and 
unknown-total sampling methods entirely, so the choice is always manual?

Finally, if hasLength!R == false, is there any way of guaranteeing that the 
input range is still going to be ultimately finite?  There could be some very 
nasty worst-case behaviour in the case of infinite ranges.

I'm asking this here rather than d-learn as it's related to design choices for 
Phobos, but I'll happily move the discussion over if necessary.

Thanks and best wishes,

     -- Joe