OOP, faster data layouts, compilers
qznc via Digitalmars-d
digitalmars-d at puremagic.com
Wed Sep 2 12:04:09 PDT 2015
On Tuesday, 3 May 2011 at 20:51:37 UTC, bearophile wrote:
> Sean Cavanaugh:
>
>> In many ways the biggest thing I use regularly in game
>> development that I would lose by moving to D would be good
>> built-in SIMD support.
>
> Don has given a nice answer about how D2 plans to face this.
>
> To focus more what Don was saying I think a small exaple will
> help. This is a C implementation of one Computer Shootout
> benchmarks, that generates a binary PPM image of the Mandelbrot
> set:
>
> http://shootout.alioth.debian.org/u32/program.php?test=mandelbrot&lang=gcc&id=4
>
> This is an important part of that C version:
>
>
> typedef double v2df __attribute__ ((vector_size(16))); /*
> vector of two doubles */
> const v2df zero = { 0.0, 0.0 };
> const v2df four = { 4.0, 4.0 };
>
> // Constant throughout the program, value depends on N
> int bytes_per_row;
> double inverse_w;
> double inverse_h;
>
> // Program argument: height and width of the image
> int N;
>
> // Lookup table for initial real-axis value
> v2df *Crvs;
>
> // Mandelbrot bitmap
> uint8_t *bitmap;
>
> static void calc_row(int y) {
> uint8_t *row_bitmap = bitmap + (bytes_per_row * y);
> int x;
> const v2df Civ_init = { y*inverse_h-1.0, y*inverse_h-1.0 };
>
> for (x = 0; x < N; x += 2) {
> v2df Crv = Crvs[x >> 1];
> v2df Civ = Civ_init;
> v2df Zrv = zero;
> v2df Ziv = zero;
> v2df Trv = zero;
> v2df Tiv = zero;
> int i = 50;
> int two_pixels;
> v2df is_still_bounded;
>
> do {
> Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
> Zrv = Trv - Tiv + Crv;
> Trv = Zrv * Zrv;
> Tiv = Ziv * Ziv;
>
> // All bits will be set to 1 if 'Trv + Tiv' is less than 4
> // and all bits will be set to 0 otherwise. Two elements
> // are calculated in parallel here.
> is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv,
> four);
>
> // Move the sign-bit of the low element to bit 0, move the
> // sign-bit of the high element to bit 1. The result is
> // that the pixel will be set if the calculation was
> // bounded.
> two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
> } while (--i > 0 && two_pixels);
>
> // The pixel bits must be in the most and second most
> // significant position
> two_pixels <<= 6;
>
> // Add the two pixels to the bitmap, all bits are
> // initially zero since the area was allocated with calloc()
> row_bitmap[x >> 3] |= (uint8_t) (two_pixels >> (x & 7));
> }
> }
>
>
> GCC 4.6 compiles the inner do-while loop of calc_row() to just
> this very clean assembly, that in my opinion is quite
> _beautiful_, it shows one of the most important final purposes
> of a good compiler:
>
> L9:
> subl $1, %ecx
> addpd %xmm0, %xmm0
> mulpd %xmm0, %xmm1
> movapd %xmm4, %xmm0
> addpd %xmm6, %xmm1
> addpd %xmm5, %xmm0
> subpd %xmm3, %xmm0
> movapd %xmm1, %xmm3
> movapd %xmm0, %xmm4
> mulpd %xmm1, %xmm3
> mulpd %xmm0, %xmm4
> movapd %xmm3, %xmm2
> addpd %xmm4, %xmm2
> cmplepd %xmm7, %xmm2
> movmskpd %xmm2, %ebx
> je L18
> testl %ebx, %ebx
> jne L9
>
>
> Those addpd, subpd, mulpd, movapd, etc, instructions work on
> pairs of doubles (those v2df). And the code uses the cmplepd
> and movmskpd instructions too, in a very clean way, that I
> think not even GCC 4.6 is normally able to use by itself. A
> good language + compiler have many purposes, but producing ASM
> code like that is one of the most important purposes,
> expecially if you write numerical code.
>
> A numerical programmer really wants to write code that somehow
> produces equally clean and powerful code (or better, using AVX
> 256-bit registers and 3-way instructions) in numerical
> processing kernels (often such kernels are small, often just
> bodies of inner loops).
>
> D2 allows to write code almost as clean as this C one (but I
> think currently no D compiler is able to turn this into clean
> inlined addpd, subpd, mulpd, movapd instructions. This is a
> compiler issue, not a language one):
>
> v2df Zrv = zero;
> ...
> Ziv = (Zrv * Ziv) + (Zrv * Ziv) + Civ;
> Zrv = Trv - Tiv + Crv;
> Trv = Zrv * Zrv;
> Tiv = Ziv * Ziv;
>
>
> In D it becomes:
>
> double[2] Zrv = zero;
> ...
> Ziv[] = (Zrv[] * Ziv[]) + (Zrv[] * Ziv[]) + Civ[];
> Zrv[] = Trv[] - Tiv[] + Crv[];
> Trv[] = Zrv[] * Zrv[];
> Tiv[] = Ziv[] * Ziv[];
>
>
> But then how do you write this in a clean way in D2/D3?
>
> do {
> ...
> is_still_bounded = __builtin_ia32_cmplepd(Trv + Tiv, four);
> two_pixels = __builtin_ia32_movmskpd(is_still_bounded);
> } while (--i > 0 && two_pixels);
>
>
>
> Using those __builtin_ia32_cmplepd() and
> __builtin_ia32_movmskpd() is not easy, so there is a tradeoff
> between allowing easy to write code, and giving power. So it's
> acceptable for a language to give a bit less power if the code
> is simpler to write. Yet, in a system language if you don't
> give people a way to produce ASM code as clean as the one I've
> shown in the inner loops of numerical processing code, some D2
> programmers will be forced to write down inline asm, and that's
> sometimes worse than using intrinsics like
> __builtin_ia32_cmplepd().
>
> Writing efficient inner loops is very important for numerical
> processing code, and I think numerical processing code is
> important for D2.
>
> Time ago I have suggested to extend the D2 vector operations to
> code like this, but I think this is not enough still:
>
> float[4] a, b, c, d;
> c = a[] == b[];
> d = a[] >= b[];
>
> Bye,
> bearophile
Just found this old post, since I'm tuning mandelbrot.d right now
[0].
The good news: LDC produces code, which is quite close to the C
version.
mulsd %xmm6,%xmm4
subsd %xmm1,%xmm7
addsd %xmm4,%xmm4
addsd %xmm5,%xmm7
addsd %xmm0,%xmm4
movaps %xmm7,%xmm6
mulsd %xmm6,%xmm6
movaps %xmm4,%xmm2
mulsd %xmm2,%xmm2
movaps %xmm2,%xmm1
addsd %xmm6,%xmm1
ucomisd %xmm1,%xmm3
jb 4026f0 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0x130>
jl 402680 <_D10mandelbrot11computeLineFNaNbNfmiZAa+0xc0>
Even better, the code is produce from the following (inlined!)
source,
which is pretty much the mathematical definition.
for(auto i = 0; i < iter && norm(Z) <= lim; i++)
Z = Z*Z + C;
The bad news: cmplepd and movmskpd are not used. Is that possible
somehow four years later?
The gcc code is roughly twice as fast at the moment, but I don't
know if cmplepd and movmskpd is the only thing missing.
[0] https://github.com/qznc/d-shootout
More information about the Digitalmars-d
mailing list