SIMD/intrinsincs questions

Sun Nov 8 22:53:11 PST 2009

On 11/08/2009 06:35 PM, Robert Jacques wrote:
> On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
> <lutger.blijdestijn at gmail.com> wrote:
>
>> Mike Farnsworth wrote:
>>
>> ...
>>>
>>> Of course, there are some operations that the available SSE intrinsics
>>> cover that the compiler can't expose via the typical operators, so those
>>> still need to be supported somehow. Does anyone know if ldc or dmd has
>>> those, or if they'll optimize away SSE loads and stores if I roll my own
>>> structs with asm blocks? I saw from the ldc source it had the usual llvm
>>> intrinsics, but as far as hardware-specific codegen intrinsics I
>>> couldn't
>>> spot any.
>>>
>>> Thanks,
>>> Mike Farnsworth
>>>
>>
>> Have you seen this page?
>> http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions
>>
>> This is similar to gcc's (gdc has it too) extended inline asm
>> expressions.
>> I'm not at all in the know about all this, but I think this will allow
>> you
>> to built something yourself that works well with the optimizations
>> done by
>> the compiler. If someone could clarify how these inline expressions work
>> exactly, that would be great.
>
> SSE intrinsics allow you to specify the operation, but allow the
> compiler to do the register assignments, inlining, etc. D's inline asm
> requires the programmer to manage everything.

I finally went and did a little homework, so sorry for the long reply 
that follows.

I have been experimenting with both the ldc.llvmasm.__asm() function, as 
well as getting D's asm {} to do what I want.  So far, I have been able 
to get some SSE instructions in there, but I'm running into a few 
issues.  For now, I'm only using ldc, but I'll try out dmd eventually as 
well.

* Using "-release -O5 -enable-inlining" in ldc, I can't for the life of 
me get it to inline the functions with the SSE asm statements.

* Overriding opAdd for a struct, I had a hard time getting it to not 
spit what appears to me to be a lot of extra loading / stack code.  In 
order to even get it to do what I wanted, I wrote it like this:

     Vector opAdd(Vector v)
     {
         Vector result = void;
         float* c0 = &c[0];
         float* vc0 = &v.c[0];
         float* rc0 = &v.c[0];
         asm
         {
             movaps XMM0,c0 ;
             movaps XMM1,vc0 ;
             addps XMM0,XMM1 ;
             movaps rc0,XMM0 ;
         }
         return result;
     }

And that ended up with the address-of code and stack stuff that isn't 
optimal.

* When I instead write a function like this:

     static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
     {
         asm
         {
             movaps XMM0,v1 ;
             movaps XMM1,v2 ;
             addps XMM0,XMM1 ;
             movaps rv,XMM0 ;
         }
     }

where Vector is defined as:

     align(16) struct Vector
     {
     public:
         float[4] c;
     }

(Note that 'result' is passed as 'ref' and not 'out'.  With 'out', it 
inserted init code in there, probably because the compiler thought I 
hadn't actually touched the result, even though the assembly did its 
job.  'out' is a better contract description, so it'd be nice to know 
how to suppress that.)

With this I get a fewer instructions in the function; but it still has 
an extraneous stack push/pop pair surrounding it, and it still won't 
inline for me where I call it.  It's all of 8 instructions including the 
return, and any inlining scheme that thinks that merits a function call 
instead ought to be drug out and shot. =P

* I used __asm(T)(char[], char[], T) from ldc as well, but either I suck 
at getting LLVM to recognize my constraints, or ldc doesn't support SSE 
constraints yet, but it just wouldn't take.  I ended up going the D asm 
block route once I figured out how to give it addresses without taking 
the address of everything (using ref for struct arguments works great!).

So, yeah, once I can figure out how to get any of the compilers to 
inline my asm-laced functions, and then figure out how to get an 
optimizer to eliminate all the (what should be) extraneous movaps 
instructions, then I'll be in good shape.  Until then, I won't port my 
ray tracer over to D.  But I will be happy to try to help out with 
patches/experiments until then to get to the goal of making D suitable 
for heavy SIMD calculations.  I'm talking with the ldc guys about it, as 
LLVM should be able to make really good use of this stuff (especially 
intrinsics) once the frontend can hand it off suitably.

I'm excited to work on a project like this, because if I get better at 
dealing with SIMD issues in the compiler I should be able to capitalize 
on it to make my math-heavy code even faster.  Mmmm...speed...

-Mike