Update on random sampling in Phobos

Joseph Rushton Wakeling joseph.wakeling at webdrake.net
Sat Apr 21 14:12:50 PDT 2012


Hello all,

An update on the proposed replacement for Phobos' RandomSample struct.

A demo copy of the code is on GitHub at:
https://github.com/WebDrake/RandomSample

This is a little standalone D file that can be compiled and run and which goes 
through several simple benchmark comparisons of the current Phobos RandomSample 
compared to this new version based on J.S. Vitter's superior algorithm.

General conclusions:

    * The speed difference is quite apparent whenever there is a significant
      difference (2+ orders of magnitude) between the sample and data sizes.
      It's apparent, but less marked, for smaller ratios.

    * For many practical cases (e.g. if you want one sample from a not very
      big dataset) the difference would not be noticed.  However, if you are
      doing a lot of sampling, or if you are sampling from a very large dataset,
      the difference can matter.

    * Where data records have to be read from an external source, e.g. a file,
      the time needed to do this overwhelms any difference in sampling speed.
      The new algorithm is still faster, but you don't really notice the
      difference.

If there's a desire for more benchmarks or other demonstrations of the code's 
effectiveness, please suggest what they might be and I'll go ahead and implement 
them[*].

If on the basis of the above you'd like a patch for Phobos proper, let me know 
and I'll go ahead and prepare one.  Note that before I do that, I'd like a 
decision on Jerro's tweak to the randomSample function:
https://github.com/D-Programming-Language/phobos/pull/542

... which is designed to fix bug 7936:
http://d.puremagic.com/issues/show_bug.cgi?id=7936

Thanks and best wishes,

     -- Joe

-------
[*] Just checking -- no more emails on this topic have got lost in the ether, 
have they?


More information about the Digitalmars-d mailing list