Sampling algorithms for D

Thu Apr 12 07:45:34 PDT 2012

Hello all,

My programming experience is mostly in C with some C++, so much of the 
programming style of D is not familiar to me.  With that in mind, I thought I'd 
set up a little pet project to try and learn good D programming style.

What I thought I'd do is implement some clever algorithms for random sampling 
which I've already done in a C based on the GNU Scientific Library.

The D implementation is (in very primitive form) at:
https://github.com/WebDrake/SampleD

... and I'm using GDC 4.6.3 on Ubuntu 12.04 as my compiler.

The basic challenge here is that, given a collection of N records, we want to 
take a sample of size n < N such that the chance of any given record being 
picked is equal.  The algorithms implemented are based on the notion of 
generating a series of random "skips" across the sequence of records, picking 
the records you land on.

The idea is that one initializes the algorithm by giving the total number of 
records and the sample size required.  The sampler provides a method, select(), 
which when called picks the next entry in the sample, up to the total number 
required; if you keep calling it when the sample is complete, it throws an error.

Consequently, these algorithms don't store the entire sample, or the collection 
of records (which are just assumed to be labelled 0, 1, 2, ..., N-1).  Rather, 
it just outputs, one by one as called upon, the indices of the selected records.

So, introduction away, here are the main questions I came up with in creating 
the code.

(1) Effective handling of random numbers.
----
The design concept was for the sampler classes to use a specified uniform random 
number generator if provided, otherwise to use rndGen.  This turned out to be 
trickier than anticipated, and I'm not very happy with the current solution.

My first attempt just templatized class methods select() and skip().  However, 
one of the algorithms (class VitterD) requires random initialization as well, 
and trying to template this() ran into this bug: 
http://d.puremagic.com/issues/show_bug.cgi?id=435

I'm really not too happy with turning the whole class into a template just for 
the sake of the random number generator, so can anyone suggest an effective way 
to resolve the problem?

(I may be biased here by C/GSL experience, where I've got used to passing the 
RNG as a pointer, meaning runtime polymorphism.)

(2) init() function?
----
I was unsure whether it would be good design practice for the sampler classes to 
be throwaway (1 instance = 1 use) or to put in place an init() function that 
would reset it with new record and sample sizes.  Any thoughts?

(3) Uniform random number on (0,1)
----
The algorithms' specification explicitly refers to uniform random numbers on the 
open interval, which I take to mean (0,1) i.e. excluding zero.  Phobos currently 
provides only a half-open uniform distribution on [a,b).

I've implemented a simple uniform_open() function which should return a uniform 
distribution on (a,b).  The question is, are there any plans for an 
open-interval uniform distribution in Phobos?  Can I rely on this functionality 
being provided in the long term in D's standard library?

(4) Truncation
----
The code uses C's trunc() function to return the integer part of a floating 
point number.  D doesn't seem to like it very much when I then set a size_t as 
equal to the result of trunc(), and I have to use a type cast.  Is there a more 
elegant D-ish way of achieving the same result?  roundTo() isn't adequate, as it 
rounds to the _nearest_ integer rather than the integer part.

(5) ... any general stylistic comments? :-)

Thanks and best wishes,

     -- Joe