OOP, faster data layouts, compilers

bearophile bearophileHUGS at lycos.com
Tue May 3 13:51:37 PDT 2011


Sean Cavanaugh:

> In many ways the biggest thing I use regularly in game development that
> I would lose by moving to D would be good built-in SIMD support.

Don has given a nice answer about how D2 plans to face this.

To focus more what Don was saying I think a small exaple will help. This is a C implementation of one Computer Shootout benchmarks, that generates a binary PPM image of the Mandelbrot set:

http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4

This is an important part of that C version:


typedef double v2df __attribute__ ((vector_size(16))); /* vector of two doubles */
const v2df zero = { 0.0, 0.0 };
const v2df four = { 4.0, 4.0 };

// Constant throughout the program, value depends on N
int bytes_per_row;
double inverse_w;
double inverse_h;

// Program argument: height and width of the image
int N;

// Lookup table for initial real-axis value
v2df *Crvs;

// Mandelbrot bitmap
uint8_t *bitmap;

static void calc_row(int y) {
  uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
  int x;
  const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };

  for (x = 0; x < N; x += 2) {
    v2df Crv = Crvs[x >> 1];
    v2df Civ = Civ_init;
    v2df Zrv = zero;
    v2df Ziv = zero;
    v2df Trv = zero;
    v2df Tiv = zero;
    int i = 50;
    int two_pixels;
    v2df is_still_bounded;

    do {
      Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
      Zrv = Trv - Tiv + Crv;
      Trv = Zrv * Zrv;
      Tiv = Ziv * Ziv;

      // All bits will be set to 1 if 'Trv + Tiv' is less than 4
      // and all bits will be set to 0 otherwise. Two elements
      // are calculated in parallel here.
      is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);

      // Move the sign-bit of the low element to bit 0, move the
      // sign-bit of the high element to bit 1. The result is
      // that the pixel will be set if the calculation was
      // bounded.
      two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
    } while (--i > 0 && two_pixels);

    // The pixel bits must be in the most and second most
    // significant position
    two_pixels <<= 6;

    // Add the two pixels to the bitmap, all bits are
    // initially zero since the area was allocated with calloc()
    row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
  }
}


GCC 4.6 compiles the inner do-while loop of calc_row() to just this very clean assembly, that in my opinion is quite _beautiful_, it shows one of the most important final purposes of a good compiler:

L9:
    subl    $1, %ecx
    addpd   %xmm0, %xmm0
    mulpd   %xmm0, %xmm1
    movapd  %xmm4, %xmm0
    addpd   %xmm6, %xmm1
    addpd   %xmm5, %xmm0
    subpd   %xmm3, %xmm0
    movapd  %xmm1, %xmm3
    movapd  %xmm0, %xmm4
    mulpd   %xmm1, %xmm3
    mulpd   %xmm0, %xmm4
    movapd  %xmm3, %xmm2
    addpd   %xmm4, %xmm2
    cmplepd %xmm7, %xmm2
    movmskpd    %xmm2, %ebx
    je  L18
    testl   %ebx, %ebx
    jne L9


Those addpd, subpd, mulpd, movapd, etc, instructions work on pairs of doubles (those v2df). And the code uses the cmplepd and movmskpd instructions too, in a very clean way, that I think not even GCC 4.6 is normally able to use by itself. A good language + compiler have many purposes, but producing ASM code like that is one of the most important purposes, expecially if you write numerical code.

A numerical programmer really wants to write code that somehow produces equally clean and powerful code (or better, using AVX 256-bit registers and 3-way instructions) in numerical processing kernels (often such kernels are small, often just bodies of inner loops).

D2 allows to write code almost as clean as this C one (but I think currently no D compiler is able to turn this into clean inlined addpd, subpd, mulpd, movapd instructions. This is a compiler issue, not a language one):

v2df Zrv = zero;
...
Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
Zrv = Trv - Tiv + Crv;
Trv = Zrv * Zrv;
Tiv = Ziv * Ziv;


In D it becomes:

double[2] Zrv = zero;
...
Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
Zrv[] = Trv[] - Tiv[] + Crv[];
Trv[] = Zrv[] * Zrv[];
Tiv[] = Ziv[] * Ziv[];


But then how do you write this in a clean way in D2/D3?

do {
    ...
    is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
    two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
} while (--i > 0 && two_pixels);



Using those __builtin_ia32_cmplepd() and __builtin_ia32_movmskpd() is not easy, so there is a tradeoff between allowing easy to write code, and giving power. So it's acceptable for a language to give a bit less power if the code is simpler to write. Yet, in a system language if you don't give people a way to produce ASM code as clean as the one I've shown in the inner loops of numerical processing code, some D2 programmers will be forced to write down inline asm, and that's sometimes worse than using intrinsics like __builtin_ia32_cmplepd().

Writing efficient inner loops is very important for numerical processing code, and I think numerical processing code is important for D2.

Time ago I have suggested to extend the D2 vector operations to code like this, but I think this is not enough still:

float[4] a, b, c, d;
c = a[] == b[];
d = a[] >= b[];

Bye,
bearophile


More information about the Digitalmars-d mailing list