__restrict, architecture intrinsics vs asm, consoles, and other

Thu Sep 22 10:26:47 PDT 2011

On 22/09/11 7:39 AM, Don wrote:
> On 22.09.2011 05:24, a wrote:
>> How would one do something like this without intrinsics (the code is
>> c++ using
>> gcc vector extensions):
>
> [snip]
> At present, you can't do it without ultimately resorting to inline asm.
> But, what we've done is to move SIMD into the machine model: the D
> machine model assumes that float[4] + float[4] is a more efficient
> operation than a loop.
> Currently, only arithmetic operations are implemented, and on DMD at
> least, they're still not proper intrinsics. So in the long term it'll be
> possible to do it directly, but not yet.
>
> At various times, several of us have implemented 'swizzle' using CTFE,
> giving you a syntax like:
>
> float[4] x, y;
> x[] = y[].swizzle!"cdcd"();
> // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]
>
> which compiles to a single shufps instruction.

How can it compile into a single shufps? x and y would need to already 
be in vector registers, and unless I've missed something, they won't be. 
You'll need instructions for loading into registers (using the slow 
movups because 16-byte alignment isn't guaranteed) then do the shufps, 
then load back out again.

This is too slow for performance critical code.

Being stored in XMM registers from creation, passed and returned in XMM 
registers to/from functions is a key requirement for this sort of code. 
If you have to keep loading in and out of memory then you lose all 
performance.