SIMD ideas for Rust

Fri Jul 19 09:02:01 PDT 2013

On 19 July 2013 19:33, bearophile <bearophileHUGS at lycos.com> wrote:

> Manu:
>
>  Interesting. Almost all his points are what we do already in D.
>> Always nice to see others come to the same conclusions :)
>>
>
> While trying to write a multiplication of two complex numbers using SSE3
> with LDC2 I have found about seven or more bugs, that I will discuss
> elsewhere. But regarding the syntax, in nice code like this D requires to
> add ".array" before all those subscripts (code adapted from Fog):
>
>
> double2 complexMult(in double2 a, in double2 b) pure nothrow {
>     double2 b_flip = [b.array[1], b.array[0]];
>     double2 a_im = [a.array[1], a.array[1]];
>     double2 a_re = [a.array[0], a.array[0]];
>     double2 aib = a_im * b_flip;
>     double2 arb = a_re * b;
>     return [arb.array[0] - aib.array[0], arb.array[1] + aib.array[1]];
> }
>
>
> A line like this:
>
> double2 b_flip = [b.array[1], b.array[0]];
>
> becomes something like:
>
> pshufd   $238,  %xmm1, %xmm3
>
> Similarly all the other lines become single instructions (but the last
> one, because LDC2 misses to use a addsubpd).
>
> I vaguely remember you saying that slow SIMD operations shouldn't have a
> too much short syntax to avoid giving an illusion of efficiency. But given
> that "often" the CPU executes such array subscripting and shuffling
> efficiently, isn't it nicer/enough to support a simpler syntax like this in
> D?
>
> double2 complexMult(in double2 a, in double2 b) pure nothrow {
>     double2 b_flip = [b[1], b[0]];
>     double2 a_im = [a[1], a[1]];
>     double2 a_re = [a[0], a[0]];
>     double2 aib = a_im * b_flip;
>     double2 arb = a_re * b;
>     return [arb[0] - aib[0], arb[1] + aib[1]];
> }
>

The point about eliminating the index operator is because it implies a
vector->float cast.
You want to perform a shuffle(/swizzle), but you are only really performing
the operation incidentally.
What you're really doing is casting a bunch of vector components to floats,
and then rebuilding a vector, and LLVM can helpfully deal with that.

I would suggest calling a spade a spade and using a swizzle function to
perform a swizzle, instead of code like what you wrote.
Wouldn't this be better:

double2 complexMult(in double2 a, in double2 b) pure nothrow {
    double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to
include an opDispatch in the basic type
    double2 a_im = a.yy;
    double2 a_re = a.xx;
    double2 aib = a_im * b_flip;
    double2 arb = a_re * b;

//    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
tricky... it's not very portable.

    // Maybe:
    return select([-1, 0], arb-aib, arb+aib);
    // Hopefully the x86 optimiser will generate the proper opcode. Or a
bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.
}

I think that would be better. More portable, and it eliminates the code
that implies a vector->float->vector cast sequence, which I maintain,
should be syntactically discouraged at all costs.
You don't want to be giving people bad ideas that it's reasonable code to
write ;)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20130720/edde7524/attachment.html>