Random sampling in Phobos

Tue Apr 17 08:31:00 PDT 2012

On 4/14/12 7:39 AM, Joseph Rushton Wakeling wrote:
> In short, I've been implementing some random sampling algorithms in D to
> help teach myself how to use the language effectively:
> https://github.com/WebDrake/SampleD
>
> This is an adaptation of some code I wrote in C a couple of years ago to
> help deal with simulations where I had to take a very large random
> sample from an even larger total[1]. The question came up whether this
> might be useful for the random sampling tools provided in Phobos.

Absolutely. We could and should integrate Vitter's algorithm; the only 
reason it's not there yet is because I didn't have the time.

> Phobos' std.random currently contains a RandomSample struct and
> randomSample wrapper function which output a random subsample of the
> input data. I would identify 2 problems with the current implementation.
>
> (1) The algorithm used is highly inefficient. As far as I can see the code
> implements Algorithm S as described in vol. 2 of Knuth's "The Art of
> Computer Programming" (pp. 142-143). This algorithm sequentially goes
> over each of the possible data elements deciding whether to accept or
> reject it. Consequently, it's of o(N) complexity (where N is data size)
> and requires o(N) random variates to be generated.

That is correct. However the distinction is often academic because the 
cost of generating random numbers is lower than the cost of scanning the 
data, and skipping N steps in the data is comparable to the cost of 
skipping one step N times.

> Algorithms developed by Jeffrey Scott Vitter in the 1980s allow this to be
> reduced to o(n) where n is sample size. These are the algorithms I've
> implemented.
>
> (2) RandomSample requires the whole data (and whole sample) to be loaded
> into
> memory while you're doing the sampling. Great if you can do it, but it's
> sometimes prohibitive[1]. Often you don't _need_ the data to be loaded --
> just the index values of the selected data -- and you may not even need
> to store those as a collection.

Actually that's not correct. RandomSample works fine with an input range 
and does not keep it in memory.

> The former is just to rewrite RandomSample to use Jeffrey Vitter's
> Algorithm D (nice coincidence:-). This should be fairly easy and I'm
> happy to prepare patches for this.

This would be great, but you'd need to show with benchmarks that the 
proposed implementation does better than the extant one for most cases 
that matter.

Thanks,

Andrei