RandomSample with specified random number generator
Jens Mueller
jens.k.mueller at gmx.de
Tue Jun 12 05:46:33 PDT 2012
Joseph Rushton Wakeling wrote:
> Hello all,
>
> An interesting challenge with the randomSample functionality in random.d.
>
> randomSample takes as input a range, a number of points to sample
> from that range, and (optionally) a random number generator. It
> returns a range corresponding to a lazily-calculated sample of the
> desired size.
>
> If you don't pass it a specific random number generator, then it
> just uses the thread-global random number generator. A consequence
> of this is that the sample changes each time you use it, i.e. if you
> write,
>
> auto sample1 = randomSample(iota(0, 100), 5);
> writeln(sample1);
> writeln(sample1);
> writeln(sample1);
>
> ... you get 3 different random samples, e.g.
>
> [0, 17, 18, 43, 77]
> [20, 25, 34, 53, 97]
> [6, 12, 28, 42, 87]
>
> Conversely, if you _do_ pass randomSample a specific random number
> generator, a copy of it is stored inside the RandomSample struct.
> This is necessary because, as the sample is lazily evaluated, you
> need to guarantee the RNG's existence at the point when you
> evaluate.
>
> A consequence of this is that if you evaluate the sample multiple
> times, you get the same result, i.e. if you do:
>
> auto sample2 = randomSample(iota(0, 100), 5, rndGen);
> writeln("Sample with specified RNG:");
> writeln(sample2);
> writeln(sample2);
> writeln(sample2);
>
> then you get something like e.g.
>
> [33, 35, 54, 55, 94]
> [33, 35, 54, 55, 94]
> [33, 35, 54, 55, 94]
>
> ... which is problematic because it's different behaviour from the
> case without a specific RNG being used, but not problematic per se.
> One can see a logic for either case (a sample range always giving
> the same result, or always giving a different and independent
> result), it's just a matter of being clear which will happen.
>
> However, a real problem arises if you want to have multiple samples
> each being passed a specific source of randomness. If you do:
>
> auto sample3 = randomSample(iota(0, 100), 5, rndGen);
> writeln(sample3);
> auto sample4 = randomSample(iota(0, 100), 5, rndGen);
> writeln(sample4);
> auto sample5 = randomSample(iota(0, 100), 5, rndGen);
> writeln(sample5);
>
> ... then you'd expect in principle to get 3 different samples.
> However, because rndGen is _copied_ rather than used directly, its
> state does not get updated when the samples are evaluated. As a
> consequence, each of the samples above gets passed _the same random
> number generator in the same state_, and you get out the same
> samples, e.g.
>
> [33, 35, 54, 55, 94]
> [33, 35, 54, 55, 94]
> [33, 35, 54, 55, 94]
>
> (... yes, the same as sample2, assuming we're still in the same
> program and haven't made any other uses of rndGen in the meantime).
>
> What this means is that randomSample is impossible to use
> effectively with a specified random number generator. It's not just
> that successive "different" samples produce the same output, it's
> also that if you generate a random sample and then generate other
> random numbers afterwards, they will be correlated.
>
> I can see only two possible resolutions of this issue, but I'm
> curious if anyone else can think of something different.
>
> (i) store a reference to the random number generator instead of a copy
> (is this even possible?). The trouble is, this isn't safe, because
> what if e.g. you have a function which does something like
>
> auto generateSample()
> {
> auto rng = Random(unpredictableSeed);
> auto sample = randomSample(iota(0, 100), 5, rng);
> return sample;
> }
>
> void main()
> {
> auto sample = generateSample();
> writeln(sample); // Won't work, because rng will no longer exist.
> }
>
> ... so this option seems impermissible.
>
> (ii) when copying the random number generator, seed it with an unpredictable
> seed before generating the first point of the random sample. This will
> effectively make the sample an independent thread with respect to random
> number generation. The disadvantage is that you lose the reproducibility
> of the sampling behaviour (e.g. computer game where you want to avoid the
> player being able to save/reload/try again and get a different outcome)
> and there might be unexpected correlations that occur if the seeding or
> the random number generation algorithm are done poorly.
>
> The second seems the only really valid option to me, and has the
> advantage of making identical the behaviour of the sample whether or
> not it's given a specific RNG (i.e. using the sample 3 different
> times gets you 3 different samples).
>
> ... any thoughts on this and on possible alternative options?
Probably I'm not seeing the issue
auto sample3 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample3);
auto sample4 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample4);
auto sample5 = randomSample(iota(0, 100), 5, Random(unpredictableSeed));
writeln(sample5);
outputs
[21, 38, 57, 86, 90]
[21, 39, 68, 79, 94]
[21, 22, 57, 86, 92]
Besides the fact that the ranges always contain 21 (a bug?) this looks
fine to me, doesn't it?
Jens
More information about the Digitalmars-d
mailing list