Vectorization and more

bearophile bearophileHUGS at lycos.com
Tue Apr 13 01:54:47 PDT 2010


Some musings about vectorizations.
The benchmarks of the Shootout site sometimes are useful, because they are well chosen to cover a wide variety of computations and kinds of common kernels.

One of those benchmarks is the n-body, this is one of the faster implementations, in C:
http://shootout.alioth.debian.org/u32/program.php?test=nbody&lang=gcc&id=5

Profiling of the n-body benchmark shows that most of the time is spent in the two nested loops of the advance() function (inside it there is a sqrt that is critical).

This is a summary of the code of the C implementation, the increased performance compared to the other versions comes from packing x and y in a SSE register, and z plus an empty value in another, this allows to do three operations (on x, y and z or the speeds) in two istructions instead of three:


typedef double v2df __attribute__ (( vector_size(2*sizeof(double)) ));

struct planet {
    v2df xy;
    v2df z0; /* z and zero */
    v2df vxvy;
    v2df vz00;	/* vz and zero */
    v2df massmass; /* the mass in both components */
};

static void advance(int q) {
    ...
        for (j = i + 1; j < NBODIES; j++) {
          b2 = &(bodies[j]);
          dxdy = b->xy - b2->xy;
          dz00 = b->z0 - b2->z0;
          /* dx*dx+dy*dy | dz*dz */
          tsquare = __builtin_ia32_haddpd(dxdy*dxdy,dz00*dz00);
          /* dx*dx+dy*dy+dz*dz | dx*dx+dy*dy+dz*dz */
          distance2 = __builtin_ia32_haddpd(tsquare,tsquare);
          magmag = dtdt / (__builtin_ia32_sqrtpd(distance2)*distance2);
          dxdy *= magmag;
          dz00 *= magmag;
          b->vxvy -= dxdy * b2->massmass;
          b->vz00 -= dz00 * b2->massmass;
          b2->vxvy += dxdy * b->massmass;
          b2->vz00 += dz00 * b->massmass;
        }
      }
  ...      
}


Note how the C code performs two sqrt in parallel, even if only one sqrt is necessary, because this helps the compiler keep the values in the SSE registers, avoiding a waste of time (but the CPU will perform two sqrt, so it will waster some electricity. This is not the kind of optimization you want for a netbook that has to save as much power as possible. Contrary to what Chris Lattner thinks, I belive it can exist a compilation switch like -Oe to compile a program with the purpose of saving energy at runtime).

Note2: in practice I have seen that LDC with D code is able to produce a binary about as fast as this C one, using no explicit use of SSE registers (and no array ops) :-)


Later I have created a D version of the n-body that uses arrays ops (and as expected it's a little slower than the naive version, because array ops are not refined yet in D). This is a summary of the code:


alias double[3] Tri;

struct Body {
    Tri xyz = [0,0,0];
    Tri v = [0,0,0];
    double mass = 0.0;
    ...
}

void advance(double dt) {
    foreach(idx, ref i; bodies) {
        double im = i.mass;
        foreach (ref j; bodies[idx + 1 .. length]) {
            double jm = j.mass;
            Tri d = void;
            d[] = i.xyz[] - j.xyz[];

            double distance = sqrt(d[0]*d[0] + d[1]*d[1] + d[2]*d[2]);
            double mag = dt / (distance * distance * distance);

            i.v[] -= d[] * jm * mag;
            j.v[] += d[] * im * mag;
        }
    }

    foreach (ref i; bodies)
        i.xyz[] += dt * i.v[];
}



Future D implementations will surely be able to use a SSE register to store a double[2] (and double[4] in 256 bit CPU registers, etc).

But to me the problem here seems a little different: given that Tri array, present two times inside the Body struct, will be D compilers able to use two SSE registers (with a padding item), to implement an operation like this, done among arrays of lenght 3?

d[] = i.xyz[] - j.xyz[];

I think LLVM devs eventually will add vector ops to LLVM, so LLVM will have to face both:
1) the need to pad registers when the arrays aren't as long as an integral number of SSE registers;
2) to copy scalar values in all slots of a SSE register to keep its operations as fast as possible when mixed with vector ops. (Later I might ask to Lattner about native vector ops in LLVM. This can be really useful for LDC).

Bye,
bearophile



More information about the Digitalmars-d mailing list