<div dir="ltr">On 20 July 2013 03:43, bearophile <span dir="ltr"><<a href="mailto:bearophileHUGS@lycos.com" target="_blank">bearophileHUGS@lycos.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Manu:<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

What you're really doing is casting a bunch of vector components to floats,<br>

and then rebuilding a vector, and LLVM can helpfully deal with that.<br>

<br>

I would suggest calling a spade a spade and using a swizzle function to<br>

perform a swizzle, instead of code like what you wrote.<br>

Wouldn't this be better:<br>

<br></div><div class="im">

double2 complexMult(in double2 a, in double2 b) pure nothrow {<br></div><div class="im">

    double2 b_flip = b.yx; // or b.swizzle!"yx", if we don't want to<br>

include an opDispatch in the basic type<br>

    double2 a_im = a.yy;<br>

    double2 a_re = a.xx;<br></div><div class="im">

    double2 aib = a_im * b_flip;<br>

    double2 arb = a_re * b;<br>

</div></blockquote>

<br>

I see and you are right.<br>

<br>

(If I turn the basic type into a struct containing a double2<br>

aliased-this to the whole structure, the generated code becomes<br>

awful).<br>

<br>

A YMM that already contains 8 floats, and probably SIMD registers<br>

will keep growing, maybe to become 1024 bits long. So the swizzle<br>

item names like x y z w will not suffice and some more general<br>

naming scheme is needed.</blockquote><div><br></div><div style>Swizzling bytes already has that problem. Hexadecimal swizzle strings work nicely up to 16 elements, but past that, I'd probably require the template receive a tuple of int's.</div>

<div style>These are trivial details. .xyzw are particularly useful for 2-4d vectors. They can be removed for anything higher. The nicest/most preferred interface can be decided with experience.</div><div style>As yet there's not a lot of practical experience with >128bit registers, and the sorts of patterns that appear frequently.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

//    return [arb[0] - aib[0], arb[1] + aib[1]]; // this final line is<br>

tricky... it's not very portable.<br>

<br>

    // Maybe:<br>

    return select([-1, 0], arb-aib, arb+aib);<br>

    // Hopefully the x86 optimiser will generate the proper opcode. Or a<br>

bunch of other options; a multi-vector shuffle, shift, swizzle, interleave.<br>

}<br>

<br>

I think that would be better. More portable, and it eliminates the code<br>

that implies a vector->float->vector cast sequence, which I maintain,<br>

should be syntactically discouraged at all costs.<br>

You don't want to be giving people bad ideas that it's reasonable code to<br>

write ;)<br>

</blockquote>

<br></div>

My experience in writing such kind of code is limited. I will try<br>

your select to see what kind of code LDC2-LLVM generates.<br></blockquote><div><br></div><div style>It probably won't be good because I haven't paid attention to how it optimises on SSE yet.</div><div style>You need to encourage the compiler to generate ADDSUBPD for SSE, and any (or none) of the possible expressions may result in it choosing the proper opcode.</div>

<div style>I'm apprehensive to add a helper function for that operation, since it's dreadfully SSE-specific. It's the sort of thing where you might rather carefully make sure the standard API will reliably encourage the optimiser to do it.</div>

<div style>If you can find a pattern of operations that optimises to ADDSUBPD, I'm interested to know what the sequence(/s) are.</div><div style>If not, we'll consider an explicit function. It can be emulated within reason on other architectures, but I think it would be better to work a different solution though. Ie, perform 2 (or 4) side by side (stream processing)... That will work well on all architectures.</div>

</div></div></div>