randomShuffle

Mon Jun 3 10:00:18 PDT 2013

On Monday, 3 June 2013 at 13:18:30 UTC, Joseph Rushton Wakeling 
wrote:
> On 06/03/2013 02:30 PM, Joseph Rushton Wakeling wrote:
>> On 06/03/2013 01:29 PM, Diggory wrote:
>>> For small samples from very large ranges an efficient 
>>> algorithm would be:
>>>
>>> int[] randomGen(int N, int M) {
>>>     if (N == 0)
>>>         return [];
>>>
>>>     int[] result = randomGen(N-1, M-1);
>>>
>>>     int num = rand(M);
>>>     foreach (ref int i; result)
>>>         if (i >= num)
>>>             ++i;
>>>
>>>     result ~= num;
>>>
>>>     return result;
>>> }
>> 
>> Are you sure about the efficiency of that algorithm?  Looks 
>> like it's be O(M) to me.
>
> Apologies, I misread the algorithm slightly.  Your qualifier is 
> quite correct --
> when the number of sample points is very small (e.g. 5 out of 
> some arbitrary
> very large number), that algorithm is very quick indeed (I ran 
> some tests as I
> was curious:-), and can outperform D's randomSample.
>
> It doesn't scale with the size of the input (your value M), but 
> with the number
> of sample points n, but that scaling is very bad -- so as the 
> sample size grows
> it becomes _much_ worse than randomSample.
>
> Do you have a reference for the algorithm?  Off the top of my 
> head I don't
> recognize it from any of the texts I've read.

Thanks for testing before dismissing completely :P The way it 
returns results can be improved a lot by pre-allocating a range 
of the necessary size/using a range passed in.

The complexity is O(N²) where N is the number of samples out of 
M. The algorithm is something I just thought of although I've 
possibly used it before and I'm sure someone else has thought of 
it too. It's quite simple, it effectively says "pick a value from 
the range except for this value" recursively but instead of 
dividing the range in two, it shifts the top part down first to 
make a contiguous range and then shifts the results which are 
past the split up one afterward.

I expect the phobos implementation to be something like O(M) in 
which case my method would be faster whenever N < sqrt(M) (with 
the optimisations)