Can I get a more in-depth guide about the inline assembler?

Thu Jun 2 09:37:11 PDT 2016

On Thursday, 2 June 2016 at 13:32:51 UTC, ZILtoid1991 wrote:
> On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
>> Could you also paste the D version of your code? Perhaps the 
>> compiler (LDC, GDC) will generate similarly vectorized code 
>> that is inlinable, etc.
>
> ubyte[4] dest2 = *p;
> dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - 
> src[0]))>>8);
> dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - 
> src[0]))>>8);
> dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - 
> src[0]))>>8);
> *p = dest2;
>
> The main problem with this is that it's much slower, even if I 
> would calculate the alpha blending values once. The assembly 
> code does not seem to have higher impact than the "replace if 
> alpha = 255" algorithm:
>
> if(src[0] == 255){
> *p = src;
> }
>
> It also seems I have a quite few problems with the assembly 
> code, mostly with the pmulhuw command (it returns the higher 16 
> bit of the result, I need the lower 16 bit as unsigned), also 
> with the pointers, as the read outs and write backs doesn't 
> land to their correct places, sometimes resulting in a 
> flickering screen or wrong colors affecting neighboring pixels. 
> Current assembly code:

  I'd say the major portion of your speedup happens to be because 
you're trying to do 3 things at once. Rather specifically, 
because you're working with 3 8bit colors, you have 24bits of 
data to work with, and by adding 8bits for fixed floating point 
you can do a multiply and do 4 small multiplies in a single 
command.

  You'd probably get a similar effect from bit shifting before and 
after the results. Since you're working with 3 colors and the 
alpha/multiplier... This assumes you do it without MMX. (reduces 
6 multiplies to a mere 2)

ulong tmp1 = (src[1] << 32) | (src[2] << 16) | src[3];
ulong tmp2 = (dest2[1] << 32) | (dest2[2] << 16) | dest2[3];

tmp1 *= src[0]+1;
tmp1 += tmp2*(256 - src[0]);

src[3] = (tmp1 >> 8) & 0xff;
src[2] = (tmp1 >> 24) & 0xff;
src[1] = (tmp1 >> 40) & 0xff;

  You could also increase the bit precision up so if you decided 
to do further adds or some other calculations it would have more 
room to fudge with, but not much. Say if you gave yourself 20 
bits per variable rather than 16, the values can then hold 16x 
higher for getting say the average of x values at no cost (if 
divisible by ^2) other than a little difference in how you write 
it :)

  Although you might still get a better result from MMX 
instructions if you have them in the right order. Don't forget 
though MMX uses the same register space as floating point, so 
mixing the two is a big no-no.