SIMD support...

Fri Jan 6 12:40:40 PST 2012

On Fri, 06 Jan 2012 20:00:15 +0100, Manu <turkeyman at gmail.com> wrote:

> On 6 January 2012 20:17, Martin Nowak <dawg at dawgfoto.de> wrote:
>
>> There is another benefit.
>> Consider the following:
>>
>> __vec128 addps(__vec128 a, __vec128 b) pure
>> {
>>    __vec128 res = a;
>>
>>    if (__ctfe)
>>    {
>>        foreach(i; 0 .. 4)
>>           res[i] += b[i];
>>    }
>>    else
>>    {
>>        asm (res, b)
>>        {
>>            addps res, b;
>>        }
>>    }
>>    return res;
>>
>> }
>>
>
> You don't need to use inline ASM to be able to do this, it will work the
> same with intrinsics.
> I've detailed numerous problems with using inline asm, and complications
> with extending the inline assembler to support this.
>
Don't get me wrong here. The idea is to find out if intrinsics
can be build with the help of inlineable asm functions.
The ctfe support is one good reason to go with a library solution.

>  * Assembly blocks present problems for the optimiser, it's not reliable
>>> that it can optimise around an inline asm blocks. How bad will it be  
>>> when
>>> trying to optimise around 100 small inlined functions each containing  
>>> its
>>> own inline asm blocks?
>>>
>> What do you mean by optimizing around? I don't see any apparent reason  
>> why
>> that
>> should perform worse than using intrinsics.
>>
>
> Most compilers can't reschedule code around inline asm blocks. There are  
> a
> lot of reasons for this, google can help you.
> The main reason is that a COMPILER doesn't attempt to understand the
> assembly it's being asked to insert inline. The information that it may  
> use
It doesn't have to understand the assembly.
Wrapping these in functions creates an IR expression with inputs and  
outputs.
Declaring them as pure gives the compiler free hands to apply whatever
optimizations he does normally on an IR tree.
Common subexpressions elimination, removing dead expressions...

> for optimisation is never present, so it can't do it's job.
>
>
>> The only implementation issue could be that lots of inlined asm snippets
>> make plenty basic blocks which could slow down certain compiler  
>> algorithms.
>
>
> Same problem as above. The compiler would need to understand enough about
> assembly to perform optimisation on the assembly its self to clean this  
> up.
> Using intrinsics, all the register allocation, load/store code, etc, is  
> all
> in the regular realm of compiling the language, and the code generation  
> and
> optimisation will all work as usual.
>
There is no informational difference between the intrinsic

__m128 _mm_add_ps(__m128 a, __m128 b);

and an inline assembler version

__m128 _mm_add_ps(__m128 a, __m128 b)
{
     asm
     {
          addps a, b;
     }
}

>  * D's inline assembly syntax has to be carefully translated to GCC's
>>> inline asm format when using GCC, and this needs to be done
>>> PER-ARCHITECTURE, which Iain should not be expected to do for all the
>>> obscure architectures GCC supports.
>>>
>>>  ???
>> This would be needed for opcodes as well. You initial goal was to  
>> directly
>> influence
>> code gen up to instruction level, how should that be achieved without
>> platform specific
>> extension. Quite contrary with ops and asm he will need two hack paths
>> into gcc's codegen.
>
>
>> What I see here is that we can do much good things to the inline
>> assembler while achieving the same goal.
>> With intrinsics on the other hand we're adding a very specialized
>> maintenance burden.
>
>
> You need to understand how the inline assembler works in GCC to  
> understand
> the problems with this.
> GCC basically receives a string containing assembly code. It does not
> attempt to understand it, it just pastes it in the .s file verbatim.
> This means, you can support any architecture without any additional  
> work...
> you just type the appropriate architectures asm in your program and it's
> fine... but now if we want to perform pseudo-register assignment, or
> parameter substitution, we need a front end that parses the D asm
> expressions, and generated a valid asm string for GCC.. It can't generate
> that string without detailed knowledge of the architecture its targeting,
> and it's not feasible to implement that support for all the architectures
> GCC supports.
>
So the argument here is that intrinsics in D can easier be
mapped to existing intrinsics in GCC?
I do understand that this will be pretty difficult for GDC
to implement.
Reminds me that Walter has stated several times how much
better an internal assembler can integrate with the language.

> Even after all that, It's still not ideal.. Inline asm reduces the  
> ability
> of the compiler to perform many optimisations.
>
> Consider this common situation and the code that will be built around it:
>>>
>>> __v128 vec = { floatX, floatY, floatZ, unsigned int packedColour ); //
>>>
>> Such is really not a good idea if the bit pattern of packedColour is a
>> denormal.
>> How can you even execute a single useful command on the floats here?
>>
>> Also mixing integer and FP instructions on the same register may
>> cause performance degradation. The registers are indeed typed CPU
>> internally.
>
>
> It's a very good idea, I am saving memory and, and also saving memory
> accesses.
>
> This leads back to the point in my OP where I said that most games
> programmers turn NaN, Den, and FP exceptions off.
> As I've also raised before, most vectors are actually float[3]'s, W is
> usually ignored and contains rubbish.
> It's conventional to stash some 32bit value in the W to fill the  
> otherwise
> wasted space, and also get the load for free alongside the position.
>
> The typical program flow, in this case:
>   * the colour will be copied out into a separate register where it will  
> be
> reinterpreted as a uint, and have an unpack process applied to it.
>   * XYZ will then be used to perform maths, ignoring W, which will  
> continue
> to accumulate rubbish values... it doesn't matter, all FP exceptions and
> such are disabled.

Putting the uint to the front slot would make your life simpler then,
only MOVD, no unpacking :).