Ranges and random numbers -- again

Mon Jun 17 12:32:14 PDT 2013

Hello all,

I think my last post on this topic may have been too convoluted, so here's
another go.  The goal here is to try and derive some well-defined strategies for
designing ranges that make use of random number generators.

The purpose of this first email is to justify what I think should be the first
basic rule of random ranges [that is, ranges where popFront() makes use of
random or pseudo-random numbers to generate the new value of front()].  That
rule is:

    **************************************************************************
    * Reading multiple times from the start of the same random range, should *
    * produce different (and statistically independent) results each time.   *
    **************************************************************************

To see why, let's start by considering a simple input range that provides an
infinite stream of random numbers in the range [min, max).

    struct SimpleRandomRange(RNG = void)
        if(isUniformRNG!RNG || is(RNG == void))
    {
        private real _min, _max, _u;

        static if (!is(RNG == void))
        {
            private RNG _gen;

            this(real min, real max, RNG gen)
            {
                _min = min;
                _max = max;
                _gen = gen;
                _u = uniform(_min, _max, _gen);
            }
        }
        else
        {
            this(real min, real max)
            {
                _min = min;
                _max = max;
                _u = uniform(_min, _max);
            }
        }

        immutable bool empty = false;

        real front() @property pure nothrow const
        {
            return _u;
        }

        void popFront()
        {
            static if (is(RNG == void))
                _u = uniform(_min, _max);
            else
                _u = uniform(_min, _max, _gen);
        }
    }

    auto simpleRandomRange()(real min, real max)
    {
        return SimpleRandomRange!void(min, max);
    }

    auto simpleRandomRange(RNG)(real min, real max, RNG gen)
        if (isUniformRNG!RNG)
    {
        return SimpleRandomRange!RNG(min, max, gen);
    }

Let's consider the basics of how this works.  If we don't specify a generator,

    auto r1 = simpleRandomRange(0.0, 1.0);

... then the call to uniform() will make use of the thread-local default random
number generator, rndGen.  If we do specify an RNG:

    Random gen;
    gen.seed(unpredictableSeed);
    auto r2 = simpleRandomRange(0.0, 1.0, gen);

... then SimpleRandomRange will store a copy of it to use when generating the
random numbers.

So, what happens when we try and read numbers from this?  Let's try and generate
some output from this range:

    auto r1 = simpleRandomRange(0.0L, 1.0L);

    auto take1 = r1.take(5);
    auto take2 = r1.take(5);

    writeln(take1);
    writeln(take1);
    writeln(take2);
    writeln(take2);

... which might give us for example:

    [0.816207, 0.40483, 0.638285, 0.00737022, 0.467244]
    [0.816207, 0.902301, 0.0961124, 0.611573, 0.579579]
    [0.816207, 0.788726, 0.614613, 0.613375, 0.136722]
    [0.816207, 0.0942176, 0.894744, 0.751959, 0.77816]

We notice immediately that these sequences are all different, except for the
very first number, which is always identical.  What's more, it's identical for
each of the two 'takes'.

What's going on here?  Well, the first value is being set in the constructor;
subsequent values are being set by calls to the global RNG rndGen, and because
its state is being updated, that affects all subsequent values generated.

So, it means that

    (1) Different 'takes' from the same range will be different;

    (2) Reading twice from the same 'take' will also generate a different
        sequence.

That in itself is not so bad -- but the always-the-same first value is a
problem, as it will bias any statistics we generate.

What happens instead if we specify a generator?

    Random gen;
    gen.seed(unpredictableSeed);
    auto r2 = simpleRandomRange(0.0L, 1.0L, gen);

    auto take3 = r2.take(5);
    writeln(take3);
    writeln(take3);

... then we may get for example:

    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]
    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]

i.e., two identical sequences.  Superficially this looks nice -- a random
sequence that is consistent within the same 'take'.  But now let's take another
5 numbers:

    auto take4 = r2.take(5);
    writeln(take4);
    writeln(take4);

... and we get again:

    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]
    [0.833161, 0.11342, 0.872063, 0.661881, 0.039937]

Why is this the same?  When we pass gen to the constructor of SimpleRandomRange,
it copies it by value -- meaning that updates to the internally-stored _gen
don't propagate back to the source RNG.  This is extremely undesirable, because
it means that we can generate very strong correlations between what ought to be
independent random sequences.

How can we resolve this dilemma?  One thought would be to ensure that any RNG
stored inside SimpleRandomRange is seeded unpredictably.  Manually, without
changing SimpleRandomRange, we can do this by making sure that we do something like:

    auto r3 = simpleRandomRange(0.0L, 1.0L, Random(unpredictableSeed));
    auto take5 = r3.take(5);

    writeln(take5);
    writeln(take5);

    auto take6 = r3.take(5);
    writeln(take6);
    writeln(take6);

... but in fact this again produces identical output all round.

The other alternative is for random number generators to be redefined as
reference types.  A simple experiment is to redefine MersenneTwisterEngine as a
final class rather than a struct [N.B. this is probably _not_ the optimal way to
turn RNGs into reference types, but is quick and easy to try out!]:

    auto gen2 = new MtClass19937(unpredictableSeed);
    auto r5 = simpleRandomRange(0.0L, 1.0L, gen2);
    auto take9 = r5.take(5);
    auto take10 = r5.take(5);

    writeln(take9);
    writeln(take9);
    writeln(take10);
    writeln(take10);

... which gives us output similar to the original example using rndGen:

    [0.844254, 0.136284, 0.0805684, 0.351376, 0.830413]
    [0.844254, 0.863786, 0.983373, 0.389967, 0.257907]
    [0.844254, 0.825822, 0.176731, 0.397709, 0.470349]
    [0.844254, 0.349027, 0.118591, 0.707107, 0.893357]

i.e., an identical first value, but otherwise different.

One might object that the previous example -- the one where simpleRandomRange
was passed Random(unpredictableSeed) -- offers a superior outcome.  After all,
it generates a reproducible pseudo-random sequence that (theoretically)
shouldn't generate correlations with other (pseudo-)random activity in the
program.  The fact that two different .take commands generate identical output
is a feature, not a problem -- each time you're reading from the start of the
range, so you'd expect to get the same values!

To this I'd offer three responses.  The first is that enforcing an unpredictable
seed for the random range (in effect, making it operate like an independent
thread in terms of its internal RNG) makes things difficult to debug as, for
debugging purposes, you want to be able to have predictable seeds for streams of
pseudo-random numbers.  The second is that it seems to me that having a
multiplicity of separate RNGs risks statistical reliability: it's difficult to
be certain that two different "unpredictable" seeds won't in fact produce
correlated sequences, and becomes more difficult if you have many different
separately seeded RNGs.

Finally, note that even if you could set aside these issues, the reproducibility
of the sequence rests on the fact that your RNG is a pseudo-random number
generator.  If instead you pass SimpleRandomRange an RNG that reads from a true
source of randomness (say, /dev/random) it will be impossible to generate
predictable sequences.

Hence the rule which I posited at the beginning of this email, and which I'll
restate here, which seems to me the only way to ensure consistent behaviour of
these entities regardless of the source of randomness used:

    **************************************************************************
    * Reading multiple times from the start of the same random range, should *
    * produce different (and statistically independent) results each time.   *
    **************************************************************************

As a corollary, this means that pseudo-random number generators (in Phobos and
elsewhere) should be redefined as reference types, which has already been
pointed out in D bugzilla:
http://d.puremagic.com/issues/show_bug.cgi?id=7067

However, that's not the end of the story.  Remember that first value from the
range being always the same?  This also needs dealing with, and (I think) is
non-trivial.  I will discuss this in the next email.

In the meantime -- I'm interested in feedback on the proposed rule. :-)

Thanks and best wishes,

    -- Joe