System programming in D (Was: The God Language)

Thu Jan 5 15:10:13 PST 2012

On 5/01/12 7:41 PM, Sean Kelly wrote:
> On Jan 5, 2012, at 10:02 AM, Manu wrote:
>>
>> That said, this is just one of numerous issues myself and the OP raised. I don't know why this one became the most popular for discussion... my suspicion is that is because this is the easiest of my complaints to dismiss and shut down ;)
>
> It's also about the only language change among the issues you mentioned.  Most of the others are QOI issues for compiler vendors.  What I've been curious about is if you really have a need for the performance that would be granted by these features, or if this is more of an idealistic issue.

It's not idealistic. For example, in my current project, I have a 3x 
perf improvement by rewriting that function with a few hundred lines of 
inline asm, purely to use SIMD instructions.

This is a nuisance because:

(a) It's hard to maintain. I have to thoroughly document what registers 
I'm using for what just so that I don't forget.

(b) Difficult to optimize further. I could optimize the inline assembly 
further by doing better scheduling of instructions, but instruction 
scheduling naturally messes up the organization of your code, which 
makes it a maintenance nightmare.

(c) It's not cross platform. Luckily x86/x86_64 are similar enough that 
I can write the code once and patch up the differences with CTFE + 
string mixins.

I know other parts of my code that would benefit from SIMD, but it's too 
much hassle to write and maintain inline assembly.

If we had support for

align(16) float[4] a, b;
a[] += b[]; // addps on x86

Then that would solve a lot of problems, but only solves the problem 
when you are doing "float-like" operations (addition, multiplication 
etc.) There's no obvious existing syntax for doing things like shuffles, 
conversions, SIMD square roots, cache control etc. that would naturally 
match to SIMD instructions.

Also, there's no way to tell the compiler whether you want to treat a 
float[4] as an array or a vector. Vectors are suited for data parallel 
execution whereas array are suited for indexing. If the compiler makes 
the wrong decision then you suffer heavily.

Ideally, we'd introduce vector types, e.g. vec_float4, vec_int4, 
vec_double2 etc.

These would naturally match to vector registers on CPUs and be aligned 
appropriately for the target platform.

Elementary operations would match naturally and generate the code you 
expect. Shuffling and other non-elementary operations would require the 
use of intrinsics.

// 4 vector norms in parallel
vec_float4 xs, ys, zs, ws;
vec_float4 lengths = vec_sqrt(xs * xs + ys * ys + zs * zs + ws * ws);

On x86 w/SSE, this would ideally generate:

// assuming xs in xmm0, ys in xmm1 etc.
mulps xmm0, xmm0;
mulps xmm1, xmm1;
addps xmm0, xmm1;
mulps xmm2, xmm2;
addps xmm0, xmm2;
mulps xmm3, xmm3;
addps xmm0, xmm3;
sqrtps xmm0, xmm0;

On platforms that don't support the vector types natively, there's two 
options (1) compile error, (2) compile, replacing them with float ops.

I think this is the only sensible way forward.