SIMD/intrinsincs questions
Robert Jacques
sandford at jhu.edu
Sun Nov 8 23:28:42 PST 2009
On Mon, 09 Nov 2009 01:53:11 -0500, Michael Farnsworth
<mike.farnsworth at gmail.com> wrote:
> On 11/08/2009 06:35 PM, Robert Jacques wrote:
>> On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
>> <lutger.blijdestijn at gmail.com> wrote:
>>
>>> Mike Farnsworth wrote:
>>>
>>> ...
>>>>
>>>> Of course, there are some operations that the available SSE intrinsics
>>>> cover that the compiler can't expose via the typical operators, so
>>>> those
>>>> still need to be supported somehow. Does anyone know if ldc or dmd has
>>>> those, or if they'll optimize away SSE loads and stores if I roll my
>>>> own
>>>> structs with asm blocks? I saw from the ldc source it had the usual
>>>> llvm
>>>> intrinsics, but as far as hardware-specific codegen intrinsics I
>>>> couldn't
>>>> spot any.
>>>>
>>>> Thanks,
>>>> Mike Farnsworth
>>>>
>>>
>>> Have you seen this page?
>>> http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions
>>>
>>> This is similar to gcc's (gdc has it too) extended inline asm
>>> expressions.
>>> I'm not at all in the know about all this, but I think this will allow
>>> you
>>> to built something yourself that works well with the optimizations
>>> done by
>>> the compiler. If someone could clarify how these inline expressions
>>> work
>>> exactly, that would be great.
>>
>> SSE intrinsics allow you to specify the operation, but allow the
>> compiler to do the register assignments, inlining, etc. D's inline asm
>> requires the programmer to manage everything.
>
> I finally went and did a little homework, so sorry for the long reply
> that follows.
>
> I have been experimenting with both the ldc.llvmasm.__asm() function, as
> well as getting D's asm {} to do what I want. So far, I have been able
> to get some SSE instructions in there, but I'm running into a few
> issues. For now, I'm only using ldc, but I'll try out dmd eventually as
> well.
>
>
> * Using "-release -O5 -enable-inlining" in ldc, I can't for the life of
> me get it to inline the functions with the SSE asm statements.
>
>
> * Overriding opAdd for a struct, I had a hard time getting it to not
> spit what appears to me to be a lot of extra loading / stack code. In
> order to even get it to do what I wanted, I wrote it like this:
>
> Vector opAdd(Vector v)
> {
> Vector result = void;
> float* c0 = &c[0];
> float* vc0 = &v.c[0];
> float* rc0 = &v.c[0];
> asm
> {
> movaps XMM0,c0 ;
> movaps XMM1,vc0 ;
> addps XMM0,XMM1 ;
> movaps rc0,XMM0 ;
> }
> return result;
> }
>
> And that ended up with the address-of code and stack stuff that isn't
> optimal.
>
>
> * When I instead write a function like this:
>
> static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
> {
> asm
> {
> movaps XMM0,v1 ;
> movaps XMM1,v2 ;
> addps XMM0,XMM1 ;
> movaps rv,XMM0 ;
> }
> }
>
> where Vector is defined as:
>
> align(16) struct Vector
> {
> public:
> float[4] c;
> }
>
> (Note that 'result' is passed as 'ref' and not 'out'. With 'out', it
> inserted init code in there, probably because the compiler thought I
> hadn't actually touched the result, even though the assembly did its
> job. 'out' is a better contract description, so it'd be nice to know
> how to suppress that.)
>
> With this I get a fewer instructions in the function; but it still has
> an extraneous stack push/pop pair surrounding it, and it still won't
> inline for me where I call it. It's all of 8 instructions including the
> return, and any inlining scheme that thinks that merits a function call
> instead ought to be drug out and shot. =P
>
>
> * I used __asm(T)(char[], char[], T) from ldc as well, but either I suck
> at getting LLVM to recognize my constraints, or ldc doesn't support SSE
> constraints yet, but it just wouldn't take. I ended up going the D asm
> block route once I figured out how to give it addresses without taking
> the address of everything (using ref for struct arguments works great!).
>
>
> So, yeah, once I can figure out how to get any of the compilers to
> inline my asm-laced functions, and then figure out how to get an
> optimizer to eliminate all the (what should be) extraneous movaps
> instructions, then I'll be in good shape. Until then, I won't port my
> ray tracer over to D. But I will be happy to try to help out with
> patches/experiments until then to get to the goal of making D suitable
> for heavy SIMD calculations. I'm talking with the ldc guys about it, as
> LLVM should be able to make really good use of this stuff (especially
> intrinsics) once the frontend can hand it off suitably.
>
> I'm excited to work on a project like this, because if I get better at
> dealing with SIMD issues in the compiler I should be able to capitalize
> on it to make my math-heavy code even faster. Mmmm...speed...
>
> -Mike
By design, D asm blocks are separated from the optimizer: no code motion,
etc occurs. D2 just changed fixed sized arrays to value types, which
provide most of the functionality of a small vector struct. However,
actual SSE optimization of these types is probably going to wait until x64
support; since a bunch of 32-bit chips don't support them.
P.S. For what it's worth, I do research which involves volumetric
ray-tracing. I've always found memory to bottleneck computations. Also,
why not look into CUDA/OpenCL/DirectCompute?
More information about the Digitalmars-d
mailing list