Ranges and random numbers -- initializing .front and related values

Tue Jun 18 09:29:53 PDT 2013

On Tue, Jun 18, 2013 at 03:17:01PM +0100, Joseph Rushton Wakeling wrote:
> Hello all,
> 
> The previous discussion on random ranges -- that is, ranges where
> popFront() calls a random number generator (real or pseudo) -- seems
> to have come to a nice conclusion: namely, that the problems
> identified can be resolved by ensuring that random number generators
> operate as reference rather than value types.
> 
> So now, let's turn to a different set of problems that arise with
> random numbers and ranges.  In this case I don't have a concrete
> proposal or solution, but I'd like to present the problems and invite
> discussion.
> 
> Let's take as read that a conversion to reference types has been made
> (e.g. the attached Mersenne Twister variant).  Now consider the
> following random range, which generates a sequence of n random numbers
> u_1, u_2, ..., u_n, uniformly distributed in [0, 1) and also
> calculates the remainder of n - \sum_{i}u_i.
> 
>     struct Remainder(RNG)
>         if(isUniformRNG!RNG)
>     {
>         private real _remaining, _u;
>         private size_t _n;
>         private RNG _gen;
> 
>         this(size_t n, RNG gen)
>         {
>             assert(n > 0);
>             _n = n;
>             _remaining = n;
>             _gen = gen;
>             prime();       // sets value of _u.
>         }
[...]
>         private void prime()
>         {
>             _u = uniform(0.0L, 1.0L, _gen);
>             _remaining -= _u;
>         }
>     }
[...]
> So, let's initialize one of these ranges with a reference-type PRNG,
> and see what comes out if it.
> 
>     auto gen = new MtClass19937(unpredictableSeed);
>     auto r = remainder(5, gen);
>     writeln(r);
>     writeln(r);
> 
> ... which gives us for example:
> 
>     [0.686448, 0.206227, 0.331488, 0.558315, 0.255811]
>     [0.686448, 0.596189, 0.602239, 0.704929, 0.740741]
> 
> What do we see?  Well, the two different readings from the range are
> different -- as they should be (see previous discussion) -- but _not_
> statistically independent: the first value is identical.  This will be
> the case no matter how many times we consume the range, because the
> very first value of .front is determined in the constructor:
> 
>     this(size_t n, RNG gen)
>     {
>         assert(n > 0);
>         _n = n;
>         _remaining = n;
>         _gen = gen;
>         prime();    // <------ sets first value of _u;
>     }
> 
> Obviously this is an unacceptable violation of statistical
> independence.  The issue here is that the initial value of .front
> shouldn't actually be determined until we start consuming the range.
> So for example we might add a boolean variable to control this, and
> rewrite the constructor and front:
[...]

Actually, I question the test code:

     auto gen = new MtClass19937(unpredictableSeed);
     auto r = remainder(5, gen);
     writeln(r);
     writeln(r);

I think there may be some misunderstanding of the semantics of r here.
This may have been caused by the fact that arrays are "anomalous" ranges
(as Jonathan put it), in that you can iterate over them multiple times
and get the same results. However, generally, wrapper ranges are not
designed to be iterated more than once; once the range has been
consumed, no assumption can be made about it afterwards. That is to say,
unless the range-wrapping function returns a forward range, they can
only be iterated over once. The second writeln above is therefore
treading on dangerous ground; the correct approach is to decide whether
you want to iterate over the same range twice:

     auto gen = new MtClass19937(unpredictableSeed);
     auto r = remainder(5, gen);
     writeln(r.save);		// assuming r is a forward range
     writeln(r);

or, if you want two *different* sequences to be extracted from gen:

     auto gen = new MtClass19937(unpredictableSeed);
     writeln(remainder(5, gen));
     writeln(remainder(5, gen));	// call remainder() twice to
					// extract 5 elements from gen
					// twice

IOW, the range returned by remainder() is not to be treated as some kind
of "repeatable data source" in the sense that iterating it multiple
times will repeat the behaviour of taking 5 elements from gen and doing
stuff to them.  Instead, it should be treated as a single-use data
source, that gets consumed by the first writeln(), so the second
writeln() is, strictly speaking, invalid.  Technically, the second
writeln should just print an empty range; the reason it doesn't, is
because r is a value type.  Arguably, it should be a reference type, but
I suppose your typical range-wrapping function in std.algorithm /
std.range assumes that the user understands that the resulting range is
intended to be iterated only once, so they return value types (structs)
as an optimization (and also to avoid GC allocations).

But r being a value type is really a question of implementation; from an
abstract POV, the user shouldn't need to know whether it's a value type
or reference type, and thus should *not* assume one way or another. So
passing r to writeln twice is, strictly speaking, incorrect code. The
correct way to achieve what you intend in this case, is to call
remainder() twice, as in my second code snippet above.

This is applicable not only to remainder(), but to many of the wrapper
ranges returned by std.range / std.algorithm. Attempting to pass them
around (by value, because they're generally structs) and iterating them
multiple times tend to produce odd results. The bad thing is that most
test cases use arrays as convenient underlying data sources, which masks
this problem because arrays have stronger semantics than your general
input/forward range, so issues like the above are often overlooked.

T

-- 
"Real programmers can write assembly code in any language. :-)" -- Larry Wall