Can I get a more in-depth guide about the inline assembler?
Era Scarecrow via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Thu Jun 2 09:37:11 PDT 2016
On Thursday, 2 June 2016 at 13:32:51 UTC, ZILtoid1991 wrote:
> On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
>> Could you also paste the D version of your code? Perhaps the
>> compiler (LDC, GDC) will generate similarly vectorized code
>> that is inlinable, etc.
>
> ubyte[4] dest2 = *p;
> dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 -
> src[0]))>>8);
> dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 -
> src[0]))>>8);
> dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 -
> src[0]))>>8);
> *p = dest2;
>
> The main problem with this is that it's much slower, even if I
> would calculate the alpha blending values once. The assembly
> code does not seem to have higher impact than the "replace if
> alpha = 255" algorithm:
>
> if(src[0] == 255){
> *p = src;
> }
>
> It also seems I have a quite few problems with the assembly
> code, mostly with the pmulhuw command (it returns the higher 16
> bit of the result, I need the lower 16 bit as unsigned), also
> with the pointers, as the read outs and write backs doesn't
> land to their correct places, sometimes resulting in a
> flickering screen or wrong colors affecting neighboring pixels.
> Current assembly code:
I'd say the major portion of your speedup happens to be because
you're trying to do 3 things at once. Rather specifically,
because you're working with 3 8bit colors, you have 24bits of
data to work with, and by adding 8bits for fixed floating point
you can do a multiply and do 4 small multiplies in a single
command.
You'd probably get a similar effect from bit shifting before and
after the results. Since you're working with 3 colors and the
alpha/multiplier... This assumes you do it without MMX. (reduces
6 multiplies to a mere 2)
ulong tmp1 = (src[1] << 32) | (src[2] << 16) | src[3];
ulong tmp2 = (dest2[1] << 32) | (dest2[2] << 16) | dest2[3];
tmp1 *= src[0]+1;
tmp1 += tmp2*(256 - src[0]);
src[3] = (tmp1 >> 8) & 0xff;
src[2] = (tmp1 >> 24) & 0xff;
src[1] = (tmp1 >> 40) & 0xff;
You could also increase the bit precision up so if you decided
to do further adds or some other calculations it would have more
room to fudge with, but not much. Say if you gave yourself 20
bits per variable rather than 16, the values can then hold 16x
higher for getting say the average of x values at no cost (if
divisible by ^2) other than a little difference in how you write
it :)
Although you might still get a better result from MMX
instructions if you have them in the right order. Don't forget
though MMX uses the same register space as floating point, so
mixing the two is a big no-no.
More information about the Digitalmars-d-learn
mailing list