Can I get a more in-depth guide about the inline assembler?

Thu Jun 2 06:32:51 PDT 2016

On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
> On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
>> Here's the assembly code for my alpha-blending routine:
>
> Could you also paste the D version of your code? Perhaps the 
> compiler (LDC, GDC) will generate similarly vectorized code 
> that is inlinable, etc.
>
> -Johan

ubyte[4] dest2 = *p;
dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - 
src[0]))>>8);
dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - 
src[0]))>>8);
dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - 
src[0]))>>8);
*p = dest2;

The main problem with this is that it's much slower, even if I 
would calculate the alpha blending values once. The assembly code 
does not seem to have higher impact than the "replace if alpha = 
255" algorithm:

if(src[0] == 255){
*p = src;
}

It also seems I have a quite few problems with the assembly code, 
mostly with the pmulhuw command (it returns the higher 16 bit of 
the result, I need the lower 16 bit as unsigned), also with the 
pointers, as the read outs and write backs doesn't land to their 
correct places, sometimes resulting in a flickering screen or 
wrong colors affecting neighboring pixels. Current assembly code:

//ushort[4] alpha = [src[0],src[0],src[0],src[0]];	//replace it 
if there's a faster method for this
ushort[4] alpha = [100,100,100,100];
//src[3] = 255;
ubyte[4] *p2 = cast(ubyte[4]*)src2.ptr;
ushort[4] *p3 = cast(ushort[4]*)alpha.ptr;
ushort[4] *pc_1 = cast(ushort[4]*)alphaMMXmul_const1.ptr;
ushort[4] *pc_256 = cast(ushort[4]*)alphaMMXmul_const256.ptr;
asm{
									//moving the values to their destinations
									mov		ESI, p2[EBP];
mov		EDI, p[EBP];
movd	MM0, [ESI];
movd	MM1, [EDI];
mov		ESI, p3[EBP];
movq	MM5, [ESI];
mov		ESI, pc_256[EBP];
movq	MM7, [ESI];
mov		ESI, pc_1[EBP];
movq	MM6, [ESI];
punpcklbw	MM2, MM0;
punpcklbw	MM3, MM1;

paddw	MM6, MM5;	//1 + alpha
psubw	MM7, MM5;	//256 - alpha

//psllw	MM2, 2;
//psllw	MM3, 2;
psrlw	MM6, 1;
psrlw	MM7, 1;
pmullw	MM2, MM6;	//src * (1 + alpha)
pmullw	MM3, MM7;	//dest * (256 - alpha)
paddw	MM3, MM2;	//(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM3, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256
									//moving the result to its place;
packuswb	MM4, MM3;
movd	[EDI-3], MM4;

emms;
}

Tried to get the correct result with trial and error, but there's 
no real improvement.