core.simd woes

Tue Oct 2 03:49:08 PDT 2012

On Tuesday, 2 October 2012 at 08:17:33 UTC, Manu wrote:
> On 7 August 2012 16:56, jerro <a at a.com> wrote:
>
>>
>>  That said, almost all simd opcodes are directly accessible in 
>> std.simd.
>>> There are relatively few obscure operations that don't have a 
>>> representing
>>> function.
>>> The unpck/shuf example above for instance, they both 
>>> effectively perform a
>>> sort of swizzle, and both are accessible through swizzle!().
>>>
>>
>> They aren't. Swizzle only takes one argument, so you cant use 
>> it to select
>> elements from two vectors. Both unpcklps and shufps take two 
>> arguments.
>> Writing a swizzle with two arguments would be much harder.
>
>
> Any usages I've missed/haven't thought of; I'm all ears.

I don't think it is possible to think of all usages of this, but 
for every simd instruction there are valid usages. At least for 
writing pfft, I found shuffling two vectors very useful. For, 
example, I needed a function that takes a small, square, power of 
two number of elements stored in vectors and bit-reverses them - 
it rearanges them so that you can calculate the new index of each 
element by reversing bits of the old index (for 16 elements using 
4 element vectors this can actually be done using 
std.simd.transpose, but for AVX it was more efficient to make 
this function work on 64 elements). There are other places in 
pfft where I need to select elements from two vectors (for 
example, here 
https://github.com/jerro/pfft/blob/sine-transform/pfft/avx_float.d#L141 
is the platform specific code for AVX).

I don't think this are the kind of things that should be 
implemented in std.simd. If you wanted to implement all such 
operations (for example bit reversing a small array) that 
somebody may find useful at some time, std.simd would need to be 
huge, and most of it would never be used.

> I can imagine, I'll have a go at it... it's something I 
> considered, but not
> all architectures can do it efficiently.
> That said, a most-efficient implementation would probably still 
> be useful
> on all architectures, but for cross platform code, I usually 
> prefer to
> encourage people taking another approach rather than supply a 
> function that
> is not particularly portable (or not efficient when ported).

One way to do it would be to do the following for every set of 
selected indices: go through all the two element one instruction 
operations, and check if any of them does exactly what you need, 
and use it if it does. Otherwise do something that will always 
work although it may not always be optimal. One option would be 
to use swizzle on both vectors to get each of the elements to 
their final index and then blend the two vectors together. For 
sse 1, 2 and 3 you would need to use xorps to blend them, so I 
guess this is one more place where you would need vector literals.

Someone who knows which two element shuffling operations the 
platform supports could still write optimal platform specific 
(but portable across compilers) code this way and for others this 
would still be useful to some degree (the documentation should 
mention that it may not be very efficient, though). But I think 
that it would be better to have platform specific APIs for 
platform specific code, as I said earlier in this thread.

>> Unfortunately I can't, at least not a clean one. Using string 
>> mixins would
>> be one way but I think no one wants that kind of API in 
>> Druntime or Phobos.
>
>
> Yeah, absolutely not.
> This is possibly the most compelling motivation behind a 
> __forceinline
> mechanism that I've seen come up... ;)
>
>  I'm already unhappy that
>>> std.simd produces redundant function calls.
>>>
>>> <rant> please  please please can haz __forceinline! </rant>
>>>
>>
>> I agree that we need that.
>>
>
> Huzzah! :)

Walter opposes this, right? I wonder how we could convince him.

There's one more thing that I wanted to ask you. If I were to add 
LDC support to std.simd, should I just add version(LDC) blocks to 
all the functions? Sounds like a lot of duplicated code...