Implementation of gcc SIMD builtins

Thu Feb 10 10:40:42 PST 2011

Iain Buclaw Wrote:

> == Quote from Mike Farnsworth (mike.farnsworth at gmail.com)'s article
> > Sorry to start a new thread on this, but I didn't want it to get lost in
> > the middle of the previous comments.  I have the start of a working
> > implementation in gdc giving access to the __builtin_ia32_* functions.
> > The way I did it so far manages to compile down to very tight SSE code.
> >  That's the good part.
> > The bad part is that there are a couple of limitations that engender
> > slightly ugly code at the moment.
> > Issue 1:
> > In order for the compiler to actually recognize the builtins when you
> > call them, I had to define a set of custom types that represent gcc's
> > types with the vector_size attribute that get passed into the builtins.
> > I couldn't use float[4] for V4SF as I (and Iaian) had hoped, as I can't
> > set the alignment, and it automatically generates calls to _d_array_init
> > and _d_array_copy and such, rather than instead just staying in SSE
> > registers.  Let's just say the code was very non-optimal.
> 
> I didn't hope for anything, I'm not the crazy one using them. =)
> 
> > Instead, I create a struct declaration in gcc.builtins for all of the
> > types expected by those builtins, and I name them to match what they
> > would nominally contain: __v4sf would have 4 floats, __v32qi would have
> > 32 bytes, __v2df would have 2 doubles, and so forth.  Each struct has
> > 16-byte alignment and the correct size.  But here's the rub: they have
> > no fields in them.  I tried my darndest to add VarDeclarations to them,
> > but the fact that the actual gcc tree type wasn't a struct, it would
> > just ICE the compiler when instantiating any of those structs.
> > I'd like to fix this, so that you can literally access the contents of
> > the struct like it had a float[4], or a double[2], or a byte[32], or
> > whatever it should actually have; or instead it should give you an
> > overloaded [] operator for direct indexing.
> > Issue 2:
> > The builtin structs I generate are *not* recognized by the frontend as
> > having support for +, -, *, /, etc like the gcc vector_size types
> > automatically do in C and C++.  I might be able to add those and have
> > them contain code to drop into the builtins.  For now, you *must* use
> > the builtin functions to perform operations on these types.  I'm
> > obviously aiming to use the builtin functions, myself (for now).
> 
> Actually, more I think about it, the more I feel a user-defined union would be
> better to scale the shortcomings of gcc attribute support in gdc. And trying to
> use whatever builtins gcc has to offer won't get you anywhere far anytime soon.
> 
> There's one or two ICEs when using arithmetic operations (+,-,/,*,=) for typedef'd
> types with vector attributes assigned to them. This has mostly been fixed in my
> local tree (with hopefully kind error message for invalid ops too), which will be
> pushed soon after the next dmd release merge.

I tried to use the typedef'd types, but even with simple ops it gave me all sorts of weird compile errors as soon as I tried to pass them to the intrinsics.  E.g. I can get the +,-,*,/,... operators to work but not the builtin functions, or else with my builtin structs I can get the builtin functions to work, but not the +,-,*,/,... operators to work.  I'm hoping at some point I can get the best of both worlds, but I still feel pretty lost in all the gdc code (although I'm slowly learning).

> > Quick example:
> > ///// File VectorsMain.d /////
> > import gcc.builtins;
> > import mmintrins;
> > import std.stdio;
> > void main()
> > {
> >     __v4sf bv1;
> >     setvelem(bv1, 0, 1.0f);
> >     setvelem(bv1, 1, 2.0f);
> >     setvelem(bv1, 2, 3.0f);
> >     setvelem(bv1, 3, 0.0f);
> >     __v4sf bv2;
> >     setvelem(bv2, 0, 1.0f);
> >     setvelem(bv2, 1, 1.0f);
> >     setvelem(bv2, 2, 1.0f);
> >     setvelem(bv2, 3, 0.0f);
> >     __v4sf bv3 = _mm_add_ps(bv1, bv2);
> >     std.stdio.writefln("Result: (%s, %s, %s, %s)",
> >                        velem!float(bv3, 0),
> >                        velem!float(bv3, 1),
> >                        velem!float(bv3, 2),
> >                        velem!float(bv3, 3));
> > }
> > ///// File mmintrins.d /////
> > module mmintrins;
> > import gcc.builtins;
> > T velem(T, VT)(VT vector, uint elem)
> > {
> >     return (cast(T*) &vector)[elem];
> > }
> > void setvelem(T, VT)(ref VT vector, uint elem, T value)
> > {
> >     (cast(T*) &vector)[elem] = value;
> > }
> > //pragma(set_attribute, _mm_add_ps, always_inline, artificial);
> > T _mm_add_ps(T)(const(T) v1, const(T) v2)
> > {
> >     return __builtin_ia32_addps(v1, v2);
> > }
> > ///// End example /////
> > Note a few things: I made _mm_add_ps templated on vector type (I'll
> > constrain it eventually to appropriate types), and this solves a couple
> > of problems: cross-module inlining works as the other module gets the
> > whole definition, and you can technically addps types other than v4sf.
> > Note the velem and setvelem methods are just to add a pretty face on the
> > fact that the data of the struct is hidden, with no fields to access it.
> >  More checks are needed (at least in debug mode), and there will be some
> > other handy things like _mm_set1_ps and _mm_set_ps to make rapid setup
> > of vectors easier.  I'll admit that this part is a bit ugly, but it
> > works, and it generates excellent code.  I compared the actual assembly
> > generated to my own C++ code with the same intrinsics, and so far the D
> > side is keeping up.
> > Please don't collectively throw up when you see this...fast vector ops
> > are kindof a big deal for me, so be gentle. =)  What do you all think?
> > -Mike
> 
> I think I'm gonna throw up... :~)

Well, keep in mind that I still have a few things to do in the near term that should help:

1) Add at minimum an overloaded [] op to allow direct indexing into the vector structs, which should make the velem/setvelem crap extraneous.

2) Add a bunch more intrinsics wrappers that follow the Intel standard, so the utility of this goes up.

3) Add some example wrapper structs that define all of the relevant operators, dot/cross products, etc.  These will (at least for typical usage) hide all of the ugliness and hopefully will compile down to very good code, while still being proper D types that are easy to understand how to use.

Hopefully then nobody will want to throw up anymore.

-Mike