SIMD support...

Sat Jan 7 04:45:14 PST 2012

On 01/07/12 04:27, Martin Nowak wrote:
> __v128 add(__v128 a, __v128 b) pure
> {
>     __v128 res = a;
>     asm (res, b)
>     {
>         ADD res, b;
>     }
>     return res;
> }

> This is effectively achieves the same as writing this with intrinsics.
> It also greatly improves the composition of inline asm.

What it also does is allows mixing "ordinary" asm with the SIMD instructions. People will do that, because it's easier this way (less typing), and then the result is practically unportable. Cause every compiler would now have to fully understand and support that one asm variant.

If you do "__v128 __simd_add(__v128 a, __v128)" instead, you don't loose anything; in fact it could be internally implemented with your asm(). But now the "real" asm code is separate from the more generic (and sometimes even portable) simd ops -- the compiler does not need to understand asm() to be able to use it. It can still do every optimization as with the raw asm, and possibly more as it knows exactly what's going on. The explicit pure annotations are not needed. It has more freedom to choose better scheduling, ordering, sometimes instruction selection (if there's more than one alternative) and even various code transformations. Even CTFE works.
Consider the case when a lot of your above add()-like functions are inlined into another one, which will be a common pattern -- you don't want any false dependencies. (If you do care about exact instruction scheduling you're writing asm, not D, so for that case asm() is a better choice)

I wrote "__v128 __simd_add(__v128 a, __v128)" above, but that was just to keep things simple. What you actually want is "vfloat4 __simd_add(vfloat4 a, vfloat4 b)" etc. Ie strongly typed.

Whether this needs to go into the compiler itself depends on only one thing - if it can be done efficiently in a library. Efficiently in this case means "zero-cost" or "free".

Having different static types (in addition to the untyped __v(64|128|256) ones) gives you not only security (you don't accidentally end up operating on the wrong data/format because you forgot about some version() combination etc), but also allows things like overloading. Then you can write more generic code, which works with all available formats. And eg changing the precision used by some app module involves only changing a few declarations plus data entry/exit points, not modifying every single SIMD instruction.
Untyped __v128 only really works for memcpy() type functions; other than that is mainly useful for conversions and passing data etc - the cases where you don't care about the content in transit.

>> What dmd does do with the inline assembler is it keeps track of which registers are read/written, so that effective register allocation can be done for the non-asm code.
> 
> Which is why the compiler should be the one to allocate pseudo-registers.

Yep.

artur