On random numbers and ranges

Mon Jun 10 08:48:50 PDT 2013

Hello all,

Anyone looking at D bugzilla or Phobos pull requests recently will see that I've
been filing a number of bug reports and fixes against std.random.RandomSample.

(I have Diggory to thank for this -- during the course of a discussion on
D.learn I noticed an error when attempting to sample file.byLine, and that in
turn led to some careful re-examination of the whole RandomSample struct.)

Now, the good news is that it's unlikely that the issues reported will have
bitten anyone using randomSample as intended.  However, the issues that have
arisen seem interesting to discuss in a broader context of random number
generation and its relation to ranges.

These are thoughts I've been mulling over for some time, and it seems like an
opportune moment to share them, with RandomSample providing a concrete
illustration.  There's also some concrete application: I am not sure how to
proceed on certain issues I've identified, and would like feedback and advice.

So.

Let's begin by defining the concept of a Random Range: that is, a range where
popFront() makes use of a source of randomness.  This can be a pseudo-random
number generator or a source of "true" randomness such as /dev/random.

This immediately leads to a situation different from other range objects.  For
example, it means that the concept of a "save()" method is different and maybe
meaningless: we can save the current state of the range, but the forward motion
of different saved copies will diverge because of the randomness in popFront().

(Caveat: in the event that the source of randomness is a pseudo-random number
generator, it's technically possible to also save the state of the RNG.
However, this has other issues that make it undesirable, related to statistical
"safety" within the program.  This has been discussed before on the list.)

That might ultimately be boring -- a conclusion that Random Ranges can't be
Forward Ranges.  What I find more compelling is something different: many Random
Ranges also have a unique challenge that the initial value of front() cannot be
determined until it is accessed for the first time.

To see why, consider the case of RandomSample and its corresponding bug 8314.
The original version of RandomSample determined the initial value of front() in
the constructor.  This meant that if you read from the same sample multiple
times, the first value would always be identical, destroying the statistical
independence of the samples.

What I realized recently is that this has knock-on effects elsewhere.  For
example, for popFront() to work correctly, the value of front() must be
initialized.  In the case of RandomSample, calling popFront() before front() can
result in a statistical anomaly -- sampling (n-1) items with even probability
from a range of length (N-1) -- whereas what it _should_ be is sampling (n-1)
items from the remains of the input range after the first item has been selected.

There may also be other methods or properties that depend on the initialization
of front() -- for example, RandomSample's index() method assumes that the first
sample point has already been chosen.

At the same time, there are methods one would definitely want _not_ to affect
the initialization of front().  For example, RandomSample has the .length
property.  If calling .length would in turn trigger initialization of front(),
you could have code like this:

    auto sample = randomSample(iota(100), 5);
    enforce(sample.length == 5);   // <--- if it initializes the value of
                                   // .front, it will fix the first sample point

    foreach(s; sample)
        writeln(s);
    writeln();

    foreach(s; sample)  // <--- and if front is initialized by the call to
        writeln(s);     //  .length, it means this second extraction of the
                        //  sample will have an identical first point to the
                        //  previous one, which destroys statistical
                        //  independence.  Cf. Issue 8314.

The simple fix here is for the programmer to recognize what methods require
initialization of front(), to check inside those methods for some boolean switch
that indicates initialization, and depending on that switch call some
initialization function.  In RandomSample we have:

    @property auto ref front()
    {
        assert(!empty);
        // The first sample point must be determined here to avoid
        // having it always correspond to the first element of the
        // input.  The rest of the sample points are determined each
        // time we call popFront().
        if(_first)
        {
            // We can save ourselves a random variate by checking right
            // at the beginning if we should use Algorithm A.
            if((_alphaInverse * _toSelect) > _available)
            {
                _algorithmA = true;
            }
            else
            {
                _Vprime = newVprime(_toSelect);
                _algorithmA = false;
            }
            prime();
            _first = false;
        }
        return _input.front;
    }

... and what is currently in the if(_first) { } scope could be moved to a
separate private frontInit() method to be called by any of the methods and
properties that require front initialization.

The problem with this approach is that it leaves it up to the programmer to be
competent and may in the process be a rich source of bugs in random ranges.
Other consequences include the fact that properties like front() can't actually
be const or pure.

Some while ago [
http://forum.dlang.org/post/mailman.1227.1344816494.31962.digitalmars-d@puremagic.com
] I speculated about the idea of a start() method that could be guaranteed to be
called immediately before the first access to front().  In practice it would
need to be more complex than that: in fact, the check that would need to be done
would have to be something like this:

      (1) What variables are set in the start() method?

      (2) What methods or properties make use of those variables?

      (3) Find the first call to any one (and only one) of those methods or
          properties, and call start() right before it.

One could also make it a requirement that any range whose popFront() calls on
random number generation either define a start() function or explicitly disable it.

However, that seems to be a very complicated thing to define and check, and I
wonder what others think about its feasibility.

Anyway, I think this email is long enough by now -- I'm curious what others
think about these issues and how they might be addressed.  In the short term I
will probably fix the issues with RandomSample as described here, but I am very
interested in ideas for more general solutions.

{Enj,Destr}oy? :-)

Best wishes,

    -- Joe