Sampling algorithms for D
Joseph Rushton Wakeling
joseph.wakeling at webdrake.net
Thu Apr 12 07:45:34 PDT 2012
Hello all,
My programming experience is mostly in C with some C++, so much of the
programming style of D is not familiar to me. With that in mind, I thought I'd
set up a little pet project to try and learn good D programming style.
What I thought I'd do is implement some clever algorithms for random sampling
which I've already done in a C based on the GNU Scientific Library.
The D implementation is (in very primitive form) at:
https://github.com/WebDrake/SampleD
... and I'm using GDC 4.6.3 on Ubuntu 12.04 as my compiler.
The basic challenge here is that, given a collection of N records, we want to
take a sample of size n < N such that the chance of any given record being
picked is equal. The algorithms implemented are based on the notion of
generating a series of random "skips" across the sequence of records, picking
the records you land on.
The idea is that one initializes the algorithm by giving the total number of
records and the sample size required. The sampler provides a method, select(),
which when called picks the next entry in the sample, up to the total number
required; if you keep calling it when the sample is complete, it throws an error.
Consequently, these algorithms don't store the entire sample, or the collection
of records (which are just assumed to be labelled 0, 1, 2, ..., N-1). Rather,
it just outputs, one by one as called upon, the indices of the selected records.
So, introduction away, here are the main questions I came up with in creating
the code.
(1) Effective handling of random numbers.
----
The design concept was for the sampler classes to use a specified uniform random
number generator if provided, otherwise to use rndGen. This turned out to be
trickier than anticipated, and I'm not very happy with the current solution.
My first attempt just templatized class methods select() and skip(). However,
one of the algorithms (class VitterD) requires random initialization as well,
and trying to template this() ran into this bug:
http://d.puremagic.com/issues/show_bug.cgi?id=435
I'm really not too happy with turning the whole class into a template just for
the sake of the random number generator, so can anyone suggest an effective way
to resolve the problem?
(I may be biased here by C/GSL experience, where I've got used to passing the
RNG as a pointer, meaning runtime polymorphism.)
(2) init() function?
----
I was unsure whether it would be good design practice for the sampler classes to
be throwaway (1 instance = 1 use) or to put in place an init() function that
would reset it with new record and sample sizes. Any thoughts?
(3) Uniform random number on (0,1)
----
The algorithms' specification explicitly refers to uniform random numbers on the
open interval, which I take to mean (0,1) i.e. excluding zero. Phobos currently
provides only a half-open uniform distribution on [a,b).
I've implemented a simple uniform_open() function which should return a uniform
distribution on (a,b). The question is, are there any plans for an
open-interval uniform distribution in Phobos? Can I rely on this functionality
being provided in the long term in D's standard library?
(4) Truncation
----
The code uses C's trunc() function to return the integer part of a floating
point number. D doesn't seem to like it very much when I then set a size_t as
equal to the result of trunc(), and I have to use a type cast. Is there a more
elegant D-ish way of achieving the same result? roundTo() isn't adequate, as it
rounds to the _nearest_ integer rather than the integer part.
(5) ... any general stylistic comments? :-)
Thanks and best wishes,
-- Joe
More information about the Digitalmars-d-learn
mailing list