[Bench!][Mir] +54%..+185% performance boost for Mersenne Twister.

Wed Dec 14 00:23:06 PST 2016

On Saturday, 26 November 2016 at 16:31:40 UTC, Ilya Yaroshenko 
wrote:
> 1. Improve RNG generation performance by making code more 
> friendly for CPU pipelining. Tempering (finalization) 
> operations was mixed with internal payload update operations.

A note on this.  The `opCall` (or, in the range version, 
`popFront`) of Ilya's implementation mixes together two 
superficially independent actions:

   (1) calculating the current random variate from the current 
index
       of the internal state array;

   (2) updating the current index of the internal state array, and
       moving to the next entry.

It's straightforward to split out these two procedures into two 
separate methods (or at least two clearly separated sequences 
within the `opCall`), but doing so results in a notable 
performance hit (on my machine, something in the order of 1 GB/s 
less random bits).

Intertwining these steps in this way is therefore a very smart 
optimization (although TBH it feels a little worrying that it's 
necessary).