SIMD/intrinsincs questions

Sun Nov 8 23:28:42 PST 2009

On Mon, 09 Nov 2009 01:53:11 -0500, Michael Farnsworth  
<mike.farnsworth at gmail.com> wrote:

> On 11/08/2009 06:35 PM, Robert Jacques wrote:
>> On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
>> <lutger.blijdestijn at gmail.com> wrote:
>>
>>> Mike Farnsworth wrote:
>>>
>>> ...
>>>>
>>>> Of course, there are some operations that the available SSE intrinsics
>>>> cover that the compiler can't expose via the typical operators, so  
>>>> those
>>>> still need to be supported somehow. Does anyone know if ldc or dmd has
>>>> those, or if they'll optimize away SSE loads and stores if I roll my  
>>>> own
>>>> structs with asm blocks? I saw from the ldc source it had the usual  
>>>> llvm
>>>> intrinsics, but as far as hardware-specific codegen intrinsics I
>>>> couldn't
>>>> spot any.
>>>>
>>>> Thanks,
>>>> Mike Farnsworth
>>>>
>>>
>>> Have you seen this page?
>>> http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions
>>>
>>> This is similar to gcc's (gdc has it too) extended inline asm
>>> expressions.
>>> I'm not at all in the know about all this, but I think this will allow
>>> you
>>> to built something yourself that works well with the optimizations
>>> done by
>>> the compiler. If someone could clarify how these inline expressions  
>>> work
>>> exactly, that would be great.
>>
>> SSE intrinsics allow you to specify the operation, but allow the
>> compiler to do the register assignments, inlining, etc. D's inline asm
>> requires the programmer to manage everything.
>
> I finally went and did a little homework, so sorry for the long reply  
> that follows.
>
> I have been experimenting with both the ldc.llvmasm.__asm() function, as  
> well as getting D's asm {} to do what I want.  So far, I have been able  
> to get some SSE instructions in there, but I'm running into a few  
> issues.  For now, I'm only using ldc, but I'll try out dmd eventually as  
> well.
>
>
> * Using "-release -O5 -enable-inlining" in ldc, I can't for the life of  
> me get it to inline the functions with the SSE asm statements.
>
>
> * Overriding opAdd for a struct, I had a hard time getting it to not  
> spit what appears to me to be a lot of extra loading / stack code.  In  
> order to even get it to do what I wanted, I wrote it like this:
>
>      Vector opAdd(Vector v)
>      {
>          Vector result = void;
>          float* c0 = &c[0];
>          float* vc0 = &v.c[0];
>          float* rc0 = &v.c[0];
>          asm
>          {
>              movaps XMM0,c0 ;
>              movaps XMM1,vc0 ;
>              addps XMM0,XMM1 ;
>              movaps rc0,XMM0 ;
>          }
>          return result;
>      }
>
> And that ended up with the address-of code and stack stuff that isn't  
> optimal.
>
>
> * When I instead write a function like this:
>
>      static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
>      {
>          asm
>          {
>              movaps XMM0,v1 ;
>              movaps XMM1,v2 ;
>              addps XMM0,XMM1 ;
>              movaps rv,XMM0 ;
>          }
>      }
>
> where Vector is defined as:
>
>      align(16) struct Vector
>      {
>      public:
>          float[4] c;
>      }
>
> (Note that 'result' is passed as 'ref' and not 'out'.  With 'out', it  
> inserted init code in there, probably because the compiler thought I  
> hadn't actually touched the result, even though the assembly did its  
> job.  'out' is a better contract description, so it'd be nice to know  
> how to suppress that.)
>
> With this I get a fewer instructions in the function; but it still has  
> an extraneous stack push/pop pair surrounding it, and it still won't  
> inline for me where I call it.  It's all of 8 instructions including the  
> return, and any inlining scheme that thinks that merits a function call  
> instead ought to be drug out and shot. =P
>
>
> * I used __asm(T)(char[], char[], T) from ldc as well, but either I suck  
> at getting LLVM to recognize my constraints, or ldc doesn't support SSE  
> constraints yet, but it just wouldn't take.  I ended up going the D asm  
> block route once I figured out how to give it addresses without taking  
> the address of everything (using ref for struct arguments works great!).
>
>
> So, yeah, once I can figure out how to get any of the compilers to  
> inline my asm-laced functions, and then figure out how to get an  
> optimizer to eliminate all the (what should be) extraneous movaps  
> instructions, then I'll be in good shape.  Until then, I won't port my  
> ray tracer over to D.  But I will be happy to try to help out with  
> patches/experiments until then to get to the goal of making D suitable  
> for heavy SIMD calculations.  I'm talking with the ldc guys about it, as  
> LLVM should be able to make really good use of this stuff (especially  
> intrinsics) once the frontend can hand it off suitably.
>
> I'm excited to work on a project like this, because if I get better at  
> dealing with SIMD issues in the compiler I should be able to capitalize  
> on it to make my math-heavy code even faster.  Mmmm...speed...
>
> -Mike

By design, D asm blocks are separated from the optimizer: no code motion,  
etc occurs. D2 just changed fixed sized arrays to value types, which  
provide most of the functionality of a small vector struct. However,  
actual SSE optimization of these types is probably going to wait until x64  
support; since a bunch of 32-bit chips don't support them.

P.S. For what it's worth, I do research which involves volumetric  
ray-tracing. I've always found memory to bottleneck computations. Also,  
why not look into CUDA/OpenCL/DirectCompute?