Random sampling in Phobos

Sat Apr 14 05:39:03 PDT 2012

Hello all,

This is a follow-up to a discussion on the d-learn mailing list:
http://forum.dlang.org/thread/mailman.1652.1334241994.4860.digitalmars-d-learn@puremagic.com

In short, I've been implementing some random sampling algorithms in D to help 
teach myself how to use the language effectively:
https://github.com/WebDrake/SampleD

This is an adaptation of some code I wrote in C a couple of years ago to help 
deal with simulations where I had to take a very large random sample from an 
even larger total[1].  The question came up whether this might be useful for the 
random sampling tools provided in Phobos.

Phobos' std.random currently contains a RandomSample struct and randomSample 
wrapper function which output a random subsample of the input data.  I would 
identify 2 problems with the current implementation.

   (1) The algorithm used is highly inefficient.  As far as I can see the code
       implements Algorithm S as described in vol. 2 of Knuth's "The Art of
       Computer Programming" (pp. 142-143).  This algorithm sequentially goes
       over each of the possible data elements deciding whether to accept or
       reject it.  Consequently, it's of o(N) complexity (where N is data size)
       and requires o(N) random variates to be generated.

       Algorithms developed by Jeffrey Scott Vitter in the 1980s allow this to be
       reduced to o(n) where n is sample size.  These are the algorithms I've
       implemented.

   (2) RandomSample requires the whole data (and whole sample) to be loaded into
       memory while you're doing the sampling.  Great if you can do it, but it's
       sometimes prohibitive[1].  Often you don't _need_ the data to be loaded --
       just the index values of the selected data -- and you may not even need
       to store those as a collection.

       Example: you have a very big database and you want to estimate some
       statistics, so you try and take a representative but tractable subsample
       of the data.  Your code could go something like this:

           foreach sample point wanted
           {
               i = selectNextSamplePointIndex;
               data = loadDBRecord(i);
               // now add data to statistical aggregators
               ...
           }
           // and at the end of the loop you have your aggregate statistics but
           // you haven't needed to store either the index or data values of the
           // selected sample points.

So, I would suggest 2 possibilities for improvement, one minor, one more extensive.

The former is just to rewrite RandomSample to use Jeffrey Vitter's Algorithm D 
(nice coincidence:-).  This should be fairly easy and I'm happy to prepare 
patches for this.

The latter would be to separate out the generation of sample index values into a 
new class that relies on knowing only the total number of data points and the 
required sample size.  RandomSample could then make use of this class (in 
practice there could be several possibilities to choose from), but it would also 
be directly available to the user if they want to carry out sampling that does 
not rely on the data being loaded into memory.

My code as stands provides the latter, but as a novice D user I'm not sure that 
it comes up to the standard required for the standard library; there are also 
improvements I'd like to make, which I'm not sure how best to achieve and will 
need advice on.  I'm happy to work and improve it to the extent needed, but may 
need a fair amount of hand-holding along the way (people on the d-learn list 
have already been excellent).

The current code is on GitHub at https://github.com/WebDrake/SampleD and 
implements the sampling algorithm classes together with some testing/demo code. 
  To see the demonstration, just compile & run the code[2].

Anyway, what are everyone's thoughts on this proposal?

Thanks and best wishes,

     -- Joe

-- Footnote --

[1] I was simulating a sparse user-object network such as one might find in an 
online store.  This was a subset of the complete bipartite graph between U users 
and O objects.  Storing that complete graph in program memory would have become 
prohibitive for large values of U and O, e.g. for U, O = 10_000.

[2] As written it takes about 2 minutes to run on my system when compiled with 
GDC (gdmd -O -release -inline), and about 5 minutes when compiled with dmd and 
the same flags.  If runtime turns out to be prohibitive I suggest reducing the 
value of trialRepeats on line 273.