OOP, faster data layouts, compilers

Wed Sep 2 12:04:09 PDT 2015

On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:
> Sean Cavanaugh:
>
>> In many ways the biggest thing I use regularly in game 
>> development that I would lose by moving to D would be good 
>> built-in SIMD support.
>
> Don has given a nice answer about how D2 plans to face this.
>
> To focus more what Don was saying I think a small exaple will 
> help. This is a C implementation of one Computer Shootout 
> benchmarks, that generates a binary PPM image of the Mandelbrot 
> set:
>
> http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4
>
> This is an important part of that C version:
>
>
> typedef double v2df __attribute__ ((vector_size(16))); /* 
> vector of two doubles */
> const v2df zero = { 0.0, 0.0 };
> const v2df four = { 4.0, 4.0 };
>
> // Constant throughout the program, value depends on N
> int bytes_per_row;
> double inverse_w;
> double inverse_h;
>
> // Program argument: height and width of the image
> int N;
>
> // Lookup table for initial real-axis value
> v2df *Crvs;
>
> // Mandelbrot bitmap
> uint8_t *bitmap;
>
> static void calc_row(int y) {
>   uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
>   int x;
>   const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };
>
>   for (x = 0; x < N; x += 2) {
>     v2df Crv = Crvs[x >> 1];
>     v2df Civ = Civ_init;
>     v2df Zrv = zero;
>     v2df Ziv = zero;
>     v2df Trv = zero;
>     v2df Tiv = zero;
>     int i = 50;
>     int two_pixels;
>     v2df is_still_bounded;
>
>     do {
>       Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
>       Zrv = Trv - Tiv + Crv;
>       Trv = Zrv * Zrv;
>       Tiv = Ziv * Ziv;
>
>       // All bits will be set to 1 if 'Trv + Tiv' is less than 4
>       // and all bits will be set to 0 otherwise. Two elements
>       // are calculated in parallel here.
>       is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, 
> four);
>
>       // Move the sign-bit of the low element to bit 0, move the
>       // sign-bit of the high element to bit 1. The result is
>       // that the pixel will be set if the calculation was
>       // bounded.
>       two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
>     } while (--i > 0 && two_pixels);
>
>     // The pixel bits must be in the most and second most
>     // significant position
>     two_pixels <<= 6;
>
>     // Add the two pixels to the bitmap, all bits are
>     // initially zero since the area was allocated with calloc()
>     row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
>   }
> }
>
>
> GCC 4.6 compiles the inner do-while loop of calc_row() to just 
> this very clean assembly, that in my opinion is quite 
> _beautiful_, it shows one of the most important final purposes 
> of a good compiler:
>
> L9:
>     subl    $1, %ecx
>     addpd   %xmm0, %xmm0
>     mulpd   %xmm0, %xmm1
>     movapd  %xmm4, %xmm0
>     addpd   %xmm6, %xmm1
>     addpd   %xmm5, %xmm0
>     subpd   %xmm3, %xmm0
>     movapd  %xmm1, %xmm3
>     movapd  %xmm0, %xmm4
>     mulpd   %xmm1, %xmm3
>     mulpd   %xmm0, %xmm4
>     movapd  %xmm3, %xmm2
>     addpd   %xmm4, %xmm2
>     cmplepd %xmm7, %xmm2
>     movmskpd    %xmm2, %ebx
>     je  L18
>     testl   %ebx, %ebx
>     jne L9
>
>
> Those addpd, subpd, mulpd, movapd, etc, instructions work on 
> pairs of doubles (those v2df). And the code uses the cmplepd 
> and movmskpd instructions too, in a very clean way, that I 
> think not even GCC 4.6 is normally able to use by itself. A 
> good language + compiler have many purposes, but producing ASM 
> code like that is one of the most important purposes, 
> expecially if you write numerical code.
>
> A numerical programmer really wants to write code that somehow 
> produces equally clean and powerful code (or better, using AVX 
> 256-bit registers and 3-way instructions) in numerical 
> processing kernels (often such kernels are small, often just 
> bodies of inner loops).
>
> D2 allows to write code almost as clean as this C one (but I 
> think currently no D compiler is able to turn this into clean 
> inlined addpd, subpd, mulpd, movapd instructions. This is a 
> compiler issue, not a language one):
>
> v2df Zrv = zero;
> ...
> Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
> Zrv = Trv - Tiv + Crv;
> Trv = Zrv * Zrv;
> Tiv = Ziv * Ziv;
>
>
> In D it becomes:
>
> double[2] Zrv = zero;
> ...
> Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
> Zrv[] = Trv[] - Tiv[] + Crv[];
> Trv[] = Zrv[] * Zrv[];
> Tiv[] = Ziv[] * Ziv[];
>
>
> But then how do you write this in a clean way in D2/D3?
>
> do {
>     ...
>     is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
>     two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
> } while (--i > 0 && two_pixels);
>
>
>
> Using those __builtin_ia32_cmplepd() and 
> __builtin_ia32_movmskpd() is not easy, so there is a tradeoff 
> between allowing easy to write code, and giving power. So it's 
> acceptable for a language to give a bit less power if the code 
> is simpler to write. Yet, in a system language if you don't 
> give people a way to produce ASM code as clean as the one I've 
> shown in the inner loops of numerical processing code, some D2 
> programmers will be forced to write down inline asm, and that's 
> sometimes worse than using intrinsics like 
> __builtin_ia32_cmplepd().
>
> Writing efficient inner loops is very important for numerical 
> processing code, and I think numerical processing code is 
> important for D2.
>
> Time ago I have suggested to extend the D2 vector operations to 
> code like this, but I think this is not enough still:
>
> float[4] a, b, c, d;
> c = a[] == b[];
> d = a[] >= b[];
>
> Bye,
> bearophile

Just found this old post, since I'm tuning mandelbrot.d right now 
[0].

The good news: LDC produces code, which is quite close to the C 
version.

mulsd  %xmm6,%xmm4
subsd  %xmm1,%xmm7
addsd  %xmm4,%xmm4
addsd  %xmm5,%xmm7
addsd  %xmm0,%xmm4
movaps %xmm7,%xmm6
mulsd  %xmm6,%xmm6
movaps %xmm4,%xmm2
mulsd  %xmm2,%xmm2
movaps %xmm2,%xmm1
addsd  %xmm6,%xmm1
ucomisd %xmm1,%xmm3
jb     4026f0 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0x130>
jl     402680 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0xc0>

Even better, the code is produce from the following (inlined!) 
source,
which is pretty much the mathematical definition.

for(auto i = 0; i < iter && norm(Z) <= lim; i++)
         Z = Z*Z + C;

The bad news: cmplepd and movmskpd are not used. Is that possible 
somehow four years later?

The gcc code is roughly twice as fast at the moment, but I don't 
know if cmplepd and movmskpd is the only thing missing.

[0] https://github.com/qznc/d-shootout