Calculating mean and standard deviation with std.algorithm.reduce

Wed Feb 13 05:44:04 PST 2013

Hello all,

A little challenge that's been bothering me today.

The docs for std.algorithm give an illustration of its use to calculate mean and 
standard deviation in a single pass:

       // Compute sum and sum of squares in one pass
       r = reduce!("a + b", "a + b * b")(tuple(0.0, 0.0), a);
       assert(r[0] == 35); // sum
       assert(r[1] == 233); // sum of squares
       // Compute average and standard deviation from the above
       auto avg = r[0] / a.length;
       auto stdev = sqrt(r[1] / a.length - avg * avg);

However, this formula for standard deviation is one that is well known for being 
subject to potentially fatal rounding error.  Consider the following:

       float[] a = [10_000.0f, 10_001.0f, 10_002.0f];

       auto r = reduce!("a + b", "a + b * b")(tuple(0.0f, 0.0f), a);

       auto avg = r[0] / a.length;
       auto sd = sqrt(r[1] / a.length - avg * avg);
       writeln(avg, "\t", sd);

... which gives a value of 0 for standard deviation when compiled with LDC or 
GDC, and -nan with DMD, when the correct answer is 0.816497.

An improved online formula calculates together two quantities:

              {  x[1]    if k == 1
     M[k]  =  {
              {  M[k-1] + (x[k] - M[k-1]) / k    , if k > 1

              {  0      if k == 1
     Q[k]  =  {
              {  Q[k-1] + (k - 1) * ((x[k] - M[k-1]) ^^ 2) / k   ,  if k > 1

... which one can readily prove give you respectively the mean of x[1], ..., 
x[k] and the sum of squares of deviations from the mean; so that you can 
calculate sample variance as Q[n] / (n - 1), or standard variance as Q[n] / n 
with standard deviation being given by the square root of this.

In a reduce sense, you'd want to calculate the mean according to,

     reduce!"a + (b - a)/k"(0.0, x);

... where k represents the index count 1, 2, 3, ...  However, it's not evident 
to me how you could get reduce() to know this counting value.

Where calculating Q[k] is concerned, you need to know both the index value _and_ 
the value of the corresponding M[k].  Again, it's not evident to me how you'd 
indicate to reduce() what is needed.

Can anyone offer advice on how to achieve these calculations using reduce?

Thanks & best wishes,

     -- Joe