SIMD ideas for Rust

Manu turkeyman at gmail.com
Fri Jul 19 19:12:05 PDT 2013


On 20 July 2013 03:43, bearophile <bearophileHUGS at lycos.com> wrote:

> Manu:
>
>  What you're really doing is casting a bunch of vector components to
>> floats,
>> and then rebuilding a vector, and LLVM can helpfully deal with that.
>>
>> I would suggest calling a spade a spade and using a swizzle function to
>> perform a swizzle, instead of code like what you wrote.
>> Wouldn't this be better:
>>
>> double2 complexMult(in double2 a, in double2 b) pure nothrow {
>>     double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to
>> include an opDispatch in the basic type
>>     double2 a_im = a.yy;
>>     double2 a_re = a.xx;
>>     double2 aib = a_im * b_flip;
>>     double2 arb = a_re * b;
>>
>
> I see and you are right.
>
> (If I turn the basic type into a struct containing a double2
> aliased-this to the whole structure, the generated code becomes
> awful).
>
> A YMM that already contains 8 floats, and probably SIMD registers
> will keep growing, maybe to become 1024 bits long. So the swizzle
> item names like x y z w will not suffice and some more general
> naming scheme is needed.


Swizzling bytes already has that problem. Hexadecimal swizzle strings work
nicely up to 16 elements, but past that, I'd probably require the template
receive a tuple of int's.
These are trivial details. .xyzw are particularly useful for 2-4d vectors.
They can be removed for anything higher. The nicest/most preferred
interface can be decided with experience.
As yet there's not a lot of practical experience with >128bit registers,
and the sorts of patterns that appear frequently.

 //    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is
>> tricky... it's not very portable.
>>
>>     // Maybe:
>>     return select([-1, 0], arb-aib, arb+aib);
>>     // Hopefully the x86 optimiser will generate the proper opcode. Or a
>> bunch of other options; a multi-vector shuffle, shift, swizzle,
>> interleave.
>> }
>>
>> I think that would be better. More portable, and it eliminates the code
>> that implies a vector->float->vector cast sequence, which I maintain,
>> should be syntactically discouraged at all costs.
>> You don't want to be giving people bad ideas that it's reasonable code to
>> write ;)
>>
>
> My experience in writing such kind of code is limited. I will try
> your select to see what kind of code LDC2-LLVM generates.
>

It probably won't be good because I haven't paid attention to how it
optimises on SSE yet.
You need to encourage the compiler to generate ADDSUBPD for SSE, and any
(or none) of the possible expressions may result in it choosing the proper
opcode.
I'm apprehensive to add a helper function for that operation, since it's
dreadfully SSE-specific. It's the sort of thing where you might rather
carefully make sure the standard API will reliably encourage the optimiser
to do it.
If you can find a pattern of operations that optimises to ADDSUBPD, I'm
interested to know what the sequence(/s) are.
If not, we'll consider an explicit function. It can be emulated within
reason on other architectures, but I think it would be better to work a
different solution though. Ie, perform 2 (or 4) side by side (stream
processing)... That will work well on all architectures.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20130720/af258cc2/attachment.html>


More information about the Digitalmars-d mailing list