randomShuffle

Mon Jun 3 11:28:59 PDT 2013

On Monday, 3 June 2013 at 17:35:22 UTC, Joseph Rushton Wakeling 
wrote:
> On 06/03/2013 07:00 PM, Diggory wrote:
>> Thanks for testing before dismissing completely :P The way it 
>> returns results
>> can be improved a lot by pre-allocating a range of the 
>> necessary size/using a
>> range passed in.
>
> Surely. :-)
>
>> The complexity is O(N²) where N is the number of samples out 
>> of M. The algorithm
>> is something I just thought of although I've possibly used it 
>> before and I'm
>> sure someone else has thought of it too. It's quite simple, it 
>> effectively says
>> "pick a value from the range except for this value" 
>> recursively but instead of
>> dividing the range in two, it shifts the top part down first 
>> to make a
>> contiguous range and then shifts the results which are past 
>> the split up one
>> afterward.
>
> It's an interesting little piece of invention.  I'll have to 
> look around and see
> if there's anything similar documented.  I've not seen anything 
> in the
> literature which has this particular property of O(N) random 
> variates plus
> O(N^2) calculation.
>
> The fact that it's not sequential might have been a bit of an 
> issue
> historically, which maybe explains why I haven't come across 
> something like it.
>
>> I expect the phobos implementation to be something like O(M) 
>> in which case my
>> method would be faster whenever N < sqrt(M) (with the 
>> optimisations)
>
> No, the Phobos implementation is O(N) -- I wrote it :-)  See:
> http://braingam.es/2012/07/sampling-d/
>
> The API is Andrei's -- the original implementation used what 
> Knuth refers to as
> Algorithm S [which is really simple and easy to implement and 
> check, but is
> indeed O(M)], I updated it to use Algorithm D (which is 
> complicated:-).
>
> So, since randomSample takes kN time where k is constant, your 
> algorithm will
> probably win where N < k.
>
> I found that the sampling time was about equal with N = 50 and 
> your algorithm as
> it's currently written.

I'd guess that the heavy use of floating point arithmetic to 
calculate the step sizes means that algorithm has a fairly large 
constant overhead even though the complexity is smaller.