__restrict, architecture intrinsics vs asm, consoles, and other

Marco Leise Marco.Leise at gmx.de
Thu Sep 22 11:19:58 PDT 2011


Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander  
<peter.alexander.au at gmail.com>:

> On 22/09/11 7:39 AM, Don wrote:
>> On 22.09.2011 05:24, a wrote:
>>> How would one do something like this without intrinsics (the code is
>>> c++ using
>>> gcc vector extensions):
>>
>> [snip]
>> At present, you can't do it without ultimately resorting to inline asm.
>> But, what we've done is to move SIMD into the machine model: the D
>> machine model assumes that float[4] + float[4] is a more efficient
>> operation than a loop.
>> Currently, only arithmetic operations are implemented, and on DMD at
>> least, they're still not proper intrinsics. So in the long term it'll be
>> possible to do it directly, but not yet.
>>
>> At various times, several of us have implemented 'swizzle' using CTFE,
>> giving you a syntax like:
>>
>> float[4] x, y;
>> x[] = y[].swizzle!"cdcd"();
>> // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]
>>
>> which compiles to a single shufps instruction.
>
> How can it compile into a single shufps? x and y would need to already  
> be in vector registers, and unless I've missed something, they won't be.  
> You'll need instructions for loading into registers (using the slow  
> movups because 16-byte alignment isn't guaranteed) then do the shufps,  
> then load back out again.
>
> This is too slow for performance critical code.
>
> Being stored in XMM registers from creation, passed and returned in XMM  
> registers to/from functions is a key requirement for this sort of code.  
> If you have to keep loading in and out of memory then you lose all  
> performance.

I thought about this. Either write long functions, so you don't have to  
load and unload often or just make the functions assume that the  
parameters are in registers without explicit declaration.


More information about the Digitalmars-d mailing list