Implementation of gcc SIMD builtins

Thu Feb 10 01:00:08 PST 2011

Sorry to start a new thread on this, but I didn't want it to get lost in
the middle of the previous comments.  I have the start of a working
implementation in gdc giving access to the __builtin_ia32_* functions.
The way I did it so far manages to compile down to very tight SSE code.
 That's the good part.

The bad part is that there are a couple of limitations that engender
slightly ugly code at the moment.

Issue 1:

In order for the compiler to actually recognize the builtins when you
call them, I had to define a set of custom types that represent gcc's
types with the vector_size attribute that get passed into the builtins.

I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't
set the alignment, and it automatically generates calls to _d_array_init
and _d_array_copy and such, rather than instead just staying in SSE
registers.  Let's just say the code was very non-optimal.

Instead, I create a struct declaration in gcc.builtins for all of the
types expected by those builtins, and I name them to match what they
would nominally contain: __v4sf would have 4 floats, __v32qi would have
32 bytes, __v2df would have 2 doubles, and so forth.  Each struct has
16-byte alignment and the correct size.  But here's the rub: they have
no fields in them.  I tried my darndest to add VarDeclarations to them,
but the fact that the actual gcc tree type wasn't a struct, it would
just ICE the compiler when instantiating any of those structs.

I'd like to fix this, so that you can literally access the contents of
the struct like it had a float[4], or a double[2], or a byte[32], or
whatever it should actually have; or instead it should give you an
overloaded [] operator for direct indexing.

Issue 2:

The builtin structs I generate are *not* recognized by the frontend as
having support for +, -, *, /, etc like the gcc vector_size types
automatically do in C and C++.  I might be able to add those and have
them contain code to drop into the builtins.  For now, you *must* use
the builtin functions to perform operations on these types.  I'm
obviously aiming to use the builtin functions, myself (for now).

Quick example:

///// File VectorsMain.d /////

import gcc.builtins;
import mmintrins;
import std.stdio;

void main()
{
    __v4sf bv1;
    setvelem(bv1, 0, 1.0f);
    setvelem(bv1, 1, 2.0f);
    setvelem(bv1, 2, 3.0f);
    setvelem(bv1, 3, 0.0f);

    __v4sf bv2;
    setvelem(bv2, 0, 1.0f);
    setvelem(bv2, 1, 1.0f);
    setvelem(bv2, 2, 1.0f);
    setvelem(bv2, 3, 0.0f);

    __v4sf bv3 = _mm_add_ps(bv1, bv2);

    std.stdio.writefln("Result: (%s, %s, %s, %s)",
                       velem!float(bv3, 0),
                       velem!float(bv3, 1),
                       velem!float(bv3, 2),
                       velem!float(bv3, 3));
}

///// File mmintrins.d /////

module mmintrins;

import gcc.builtins;

T velem(T, VT)(VT vector, uint elem)
{
    return (cast(T*) &vector)[elem];
}

void setvelem(T, VT)(ref VT vector, uint elem, T value)
{
    (cast(T*) &vector)[elem] = value;
}

//pragma(set_attribute, _mm_add_ps, always_inline, artificial);
T _mm_add_ps(T)(const(T) v1, const(T) v2)
{
    return __builtin_ia32_addps(v1, v2);
}

///// End example /////

Note a few things: I made _mm_add_ps templated on vector type (I'll
constrain it eventually to appropriate types), and this solves a couple
of problems: cross-module inlining works as the other module gets the
whole definition, and you can technically addps types other than v4sf.

Note the velem and setvelem methods are just to add a pretty face on the
fact that the data of the struct is hidden, with no fields to access it.
 More checks are needed (at least in debug mode), and there will be some
other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup
of vectors easier.  I'll admit that this part is a bit ugly, but it
works, and it generates excellent code.  I compared the actual assembly
generated to my own C++ code with the same intrinsics, and so far the D
side is keeping up.

Please don't collectively throw up when you see this...fast vector ops
are kindof a big deal for me, so be gentle. =)  What do you all think?

-Mike