RandomSample with specified random number generator

Tue Jun 12 05:46:33 PDT 2012

Joseph Rushton Wakeling wrote:
> Hello all,
> 
> An interesting challenge with the randomSample functionality in random.d.
> 
> randomSample takes as input a range, a number of points to sample
> from that range, and (optionally) a random number generator.  It
> returns a range corresponding to a lazily-calculated sample of the
> desired size.
> 
> If you don't pass it a specific random number generator, then it
> just uses the thread-global random number generator.  A consequence
> of this is that the sample changes each time you use it, i.e. if you
> write,
> 
>       auto sample1 = randomSample(iota(0, 100), 5);
>       writeln(sample1);
>       writeln(sample1);
>       writeln(sample1);
> 
> ... you get 3 different random samples, e.g.
> 
>       [0, 17, 18, 43, 77]
>       [20, 25, 34, 53, 97]
>       [6, 12, 28, 42, 87]
> 
> Conversely, if you _do_ pass randomSample a specific random number
> generator, a copy of it is stored inside the RandomSample struct.
> This is necessary because, as the sample is lazily evaluated, you
> need to guarantee the RNG's existence at the point when you
> evaluate.
> 
> A consequence of this is that if you evaluate the sample multiple
> times, you get the same result, i.e. if you do:
> 
>       auto sample2 = randomSample(iota(0, 100), 5, rndGen);
>       writeln("Sample with specified RNG:");
>       writeln(sample2);
>       writeln(sample2);
>       writeln(sample2);
> 
> then you get something like e.g.
> 
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
> 
> ... which is problematic because it's different behaviour from the
> case without a specific RNG being used, but not problematic per se.
> One can see a logic for either case (a sample range always giving
> the same result, or always giving a different and independent
> result), it's just a matter of being clear which will happen.
> 
> However, a real problem arises if you want to have multiple samples
> each being passed a specific source of randomness.  If you do:
> 
>       auto sample3 = randomSample(iota(0, 100), 5, rndGen);
>       writeln(sample3);
>       auto sample4 = randomSample(iota(0, 100), 5, rndGen);
>       writeln(sample4);
>       auto sample5 = randomSample(iota(0, 100), 5, rndGen);
>       writeln(sample5);
> 
> ... then you'd expect in principle to get 3 different samples.
> However, because rndGen is _copied_ rather than used directly, its
> state does not get updated when the samples are evaluated.  As a
> consequence, each of the samples above gets passed _the same random
> number generator in the same state_, and you get out the same
> samples, e.g.
> 
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
>       [33, 35, 54, 55, 94]
> 
> (... yes, the same as sample2, assuming we're still in the same
> program and haven't made any other uses of rndGen in the meantime).
> 
> What this means is that randomSample is impossible to use
> effectively with a specified random number generator.  It's not just
> that successive "different" samples produce the same output, it's
> also that if you generate a random sample and then generate other
> random numbers afterwards, they will be correlated.
> 
> I can see only two possible resolutions of this issue, but I'm
> curious if anyone else can think of something different.
> 
>   (i) store a reference to the random number generator instead of a copy
>       (is this even possible?).  The trouble is, this isn't safe, because
>       what if e.g. you have a function which does something like
> 
>           auto generateSample()
>           {
>               auto rng = Random(unpredictableSeed);
>               auto sample = randomSample(iota(0, 100), 5, rng);
>               return sample;
>           }
> 
>           void main()
>           {
>               auto sample = generateSample();
>               writeln(sample);  // Won't work, because rng will no longer exist.
>           }
> 
>       ... so this option seems impermissible.
> 
>  (ii) when copying the random number generator, seed it with an unpredictable
>       seed before generating the first point of the random sample.  This will
>       effectively make the sample an independent thread with respect to random
>       number generation.  The disadvantage is that you lose the reproducibility
>       of the sampling behaviour (e.g. computer game where you want to avoid the
>       player being able to save/reload/try again and get a different outcome)
>       and there might be unexpected correlations that occur if the seeding or
>       the random number generation algorithm are done poorly.
> 
> The second seems the only really valid option to me, and has the
> advantage of making identical the behaviour of the sample whether or
> not it's given a specific RNG (i.e. using the sample 3 different
> times gets you 3 different samples).
> 
> ... any thoughts on this and on possible alternative options?

Probably I'm not seeing the issue

auto sample3 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample3);
auto sample4 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample4);
auto sample5 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample5);

outputs

[21, 38, 57, 86, 90]
[21, 39, 68, 79, 94]
[21, 22, 57, 86, 92]

Besides the fact that the ranges always contain 21 (a bug?) this looks
fine to me, doesn't it?

Jens