RandomSample with specified random number generator
Joseph Rushton Wakeling
joseph.wakeling at webdrake.net
Tue Jun 12 05:28:27 PDT 2012
Hello all,
An interesting challenge with the randomSample functionality in random.d.
randomSample takes as input a range, a number of points to sample from that
range, and (optionally) a random number generator. It returns a range
corresponding to a lazily-calculated sample of the desired size.
If you don't pass it a specific random number generator, then it just uses the
thread-global random number generator. A consequence of this is that the sample
changes each time you use it, i.e. if you write,
auto sample1 = randomSample(iota(0, 100), 5);
writeln(sample1);
writeln(sample1);
writeln(sample1);
... you get 3 different random samples, e.g.
[0, 17, 18, 43, 77]
[20, 25, 34, 53, 97]
[6, 12, 28, 42, 87]
Conversely, if you _do_ pass randomSample a specific random number generator, a
copy of it is stored inside the RandomSample struct. This is necessary because,
as the sample is lazily evaluated, you need to guarantee the RNG's existence at
the point when you evaluate.
A consequence of this is that if you evaluate the sample multiple times, you get
the same result, i.e. if you do:
auto sample2 = randomSample(iota(0, 100), 5, rndGen);
writeln("Sample with specified RNG:");
writeln(sample2);
writeln(sample2);
writeln(sample2);
then you get something like e.g.
[33, 35, 54, 55, 94]
[33, 35, 54, 55, 94]
[33, 35, 54, 55, 94]
... which is problematic because it's different behaviour from the case without
a specific RNG being used, but not problematic per se. One can see a logic for
either case (a sample range always giving the same result, or always giving a
different and independent result), it's just a matter of being clear which will
happen.
However, a real problem arises if you want to have multiple samples each being
passed a specific source of randomness. If you do:
auto sample3 = randomSample(iota(0, 100), 5, rndGen);
writeln(sample3);
auto sample4 = randomSample(iota(0, 100), 5, rndGen);
writeln(sample4);
auto sample5 = randomSample(iota(0, 100), 5, rndGen);
writeln(sample5);
... then you'd expect in principle to get 3 different samples. However, because
rndGen is _copied_ rather than used directly, its state does not get updated
when the samples are evaluated. As a consequence, each of the samples above
gets passed _the same random number generator in the same state_, and you get
out the same samples, e.g.
[33, 35, 54, 55, 94]
[33, 35, 54, 55, 94]
[33, 35, 54, 55, 94]
(... yes, the same as sample2, assuming we're still in the same program and
haven't made any other uses of rndGen in the meantime).
What this means is that randomSample is impossible to use effectively with a
specified random number generator. It's not just that successive "different"
samples produce the same output, it's also that if you generate a random sample
and then generate other random numbers afterwards, they will be correlated.
I can see only two possible resolutions of this issue, but I'm curious if anyone
else can think of something different.
(i) store a reference to the random number generator instead of a copy
(is this even possible?). The trouble is, this isn't safe, because
what if e.g. you have a function which does something like
auto generateSample()
{
auto rng = Random(unpredictableSeed);
auto sample = randomSample(iota(0, 100), 5, rng);
return sample;
}
void main()
{
auto sample = generateSample();
writeln(sample); // Won't work, because rng will no longer exist.
}
... so this option seems impermissible.
(ii) when copying the random number generator, seed it with an unpredictable
seed before generating the first point of the random sample. This will
effectively make the sample an independent thread with respect to random
number generation. The disadvantage is that you lose the reproducibility
of the sampling behaviour (e.g. computer game where you want to avoid the
player being able to save/reload/try again and get a different outcome)
and there might be unexpected correlations that occur if the seeding or
the random number generation algorithm are done poorly.
The second seems the only really valid option to me, and has the advantage of
making identical the behaviour of the sample whether or not it's given a
specific RNG (i.e. using the sample 3 different times gets you 3 different samples).
... any thoughts on this and on possible alternative options?
Thanks & best wishes,
-- Joe
More information about the Digitalmars-d
mailing list