SIMD/intrinsincs questions
Michael Farnsworth
mike.farnsworth at gmail.com
Sun Nov 8 22:53:11 PST 2009
On 11/08/2009 06:35 PM, Robert Jacques wrote:
> On Sun, 08 Nov 2009 17:47:31 -0500, Lutger
> <lutger.blijdestijn at gmail.com> wrote:
>
>> Mike Farnsworth wrote:
>>
>> ...
>>>
>>> Of course, there are some operations that the available SSE intrinsics
>>> cover that the compiler can't expose via the typical operators, so those
>>> still need to be supported somehow. Does anyone know if ldc or dmd has
>>> those, or if they'll optimize away SSE loads and stores if I roll my own
>>> structs with asm blocks? I saw from the ldc source it had the usual llvm
>>> intrinsics, but as far as hardware-specific codegen intrinsics I
>>> couldn't
>>> spot any.
>>>
>>> Thanks,
>>> Mike Farnsworth
>>>
>>
>> Have you seen this page?
>> http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions
>>
>> This is similar to gcc's (gdc has it too) extended inline asm
>> expressions.
>> I'm not at all in the know about all this, but I think this will allow
>> you
>> to built something yourself that works well with the optimizations
>> done by
>> the compiler. If someone could clarify how these inline expressions work
>> exactly, that would be great.
>
> SSE intrinsics allow you to specify the operation, but allow the
> compiler to do the register assignments, inlining, etc. D's inline asm
> requires the programmer to manage everything.
I finally went and did a little homework, so sorry for the long reply
that follows.
I have been experimenting with both the ldc.llvmasm.__asm() function, as
well as getting D's asm {} to do what I want. So far, I have been able
to get some SSE instructions in there, but I'm running into a few
issues. For now, I'm only using ldc, but I'll try out dmd eventually as
well.
* Using "-release -O5 -enable-inlining" in ldc, I can't for the life of
me get it to inline the functions with the SSE asm statements.
* Overriding opAdd for a struct, I had a hard time getting it to not
spit what appears to me to be a lot of extra loading / stack code. In
order to even get it to do what I wanted, I wrote it like this:
Vector opAdd(Vector v)
{
Vector result = void;
float* c0 = &c[0];
float* vc0 = &v.c[0];
float* rc0 = &v.c[0];
asm
{
movaps XMM0,c0 ;
movaps XMM1,vc0 ;
addps XMM0,XMM1 ;
movaps rc0,XMM0 ;
}
return result;
}
And that ended up with the address-of code and stack stuff that isn't
optimal.
* When I instead write a function like this:
static vecAdd(ref Vector v1, ref Vector v2, ref Vector result)
{
asm
{
movaps XMM0,v1 ;
movaps XMM1,v2 ;
addps XMM0,XMM1 ;
movaps rv,XMM0 ;
}
}
where Vector is defined as:
align(16) struct Vector
{
public:
float[4] c;
}
(Note that 'result' is passed as 'ref' and not 'out'. With 'out', it
inserted init code in there, probably because the compiler thought I
hadn't actually touched the result, even though the assembly did its
job. 'out' is a better contract description, so it'd be nice to know
how to suppress that.)
With this I get a fewer instructions in the function; but it still has
an extraneous stack push/pop pair surrounding it, and it still won't
inline for me where I call it. It's all of 8 instructions including the
return, and any inlining scheme that thinks that merits a function call
instead ought to be drug out and shot. =P
* I used __asm(T)(char[], char[], T) from ldc as well, but either I suck
at getting LLVM to recognize my constraints, or ldc doesn't support SSE
constraints yet, but it just wouldn't take. I ended up going the D asm
block route once I figured out how to give it addresses without taking
the address of everything (using ref for struct arguments works great!).
So, yeah, once I can figure out how to get any of the compilers to
inline my asm-laced functions, and then figure out how to get an
optimizer to eliminate all the (what should be) extraneous movaps
instructions, then I'll be in good shape. Until then, I won't port my
ray tracer over to D. But I will be happy to try to help out with
patches/experiments until then to get to the goal of making D suitable
for heavy SIMD calculations. I'm talking with the ldc guys about it, as
LLVM should be able to make really good use of this stuff (especially
intrinsics) once the frontend can hand it off suitably.
I'm excited to work on a project like this, because if I get better at
dealing with SIMD issues in the compiler I should be able to capitalize
on it to make my math-heavy code even faster. Mmmm...speed...
-Mike
More information about the Digitalmars-d
mailing list