Does dmd have SSE intrinsics?

Mon Sep 21 09:22:51 PDT 2009

dsimcha:

> What's wrong with the current implementation of array ops (other than a few misc.
> bugs that have already been filed)?  I thought they already use SSE if available.

The idea is to improve array operations so they become a handy way to efficiently use present and future (AVX too, http://en.wikipedia.org/wiki/Advanced_Vector_Extensions ) vector instructions.

So for example if in my D code I have:
float[4] a = [1.f, 2., 3., 4.];
float[4] b[] = 10f;
float[4] c = a + b;

The compiler has to use a single inlined SSE instruction to implement the third line (the 4 float sum) of D code. And to use two instructions to load & broadcast the float value 10 to a whole XMM register.

If the D code is:
float[8] a = [1.f, 2., 3., 4., 5., 6., 7., 8.];
float[8] b = [10.f, 20., 30., 40., 50., 60., 70., 80.];
float[8] c = a + b;
The current vector instructions aren't wide enough to do that in a single instruction (but future AVX will be able to), so the compiler has to inline two SSE instructions.

Currently such operations are implemented with calls to a function (that also tests if/what vector instructions are available), that slow down code if you have to sum just 4 floats.

Another problem is that some important semantics is missing, for example some shuffling, and few other things. With some care some, most, or all such operations (keeping a good look at AVX too) can be mapped to built-in array methods...

The problem here is that you don't want to tie too much the D language to the currently available vector instructions because in 5-10 years CPUs may change. So what you want is to add enough semantics that later the compiler can compile as it can (with the scalar instructions, with SSE1, with future AVX 1024 bit wide, or with something today unknown). If the language doesn't give enough semantics to the compiler, you are forced to do as GCC that now tries to infer vector operations from normal code, but it's a complex thing and usually not as efficient as using GCC SSE intrinsics.

This is something that deserves a thread here :-) In the end implementing all this doesn't look hard. It's mostly a matter of designing it well (while implementing the auto-vectorization as in GCC is harder to implement).

Bye,
bearophile