Can I get a more in-depth guide about the inline assembler?
ZILtoid1991 via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Thu Jun 2 06:32:51 PDT 2016
On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
> On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
>> Here's the assembly code for my alpha-blending routine:
>
> Could you also paste the D version of your code? Perhaps the
> compiler (LDC, GDC) will generate similarly vectorized code
> that is inlinable, etc.
>
> -Johan
ubyte[4] dest2 = *p;
dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 -
src[0]))>>8);
dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 -
src[0]))>>8);
dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 -
src[0]))>>8);
*p = dest2;
The main problem with this is that it's much slower, even if I
would calculate the alpha blending values once. The assembly code
does not seem to have higher impact than the "replace if alpha =
255" algorithm:
if(src[0] == 255){
*p = src;
}
It also seems I have a quite few problems with the assembly code,
mostly with the pmulhuw command (it returns the higher 16 bit of
the result, I need the lower 16 bit as unsigned), also with the
pointers, as the read outs and write backs doesn't land to their
correct places, sometimes resulting in a flickering screen or
wrong colors affecting neighboring pixels. Current assembly code:
//ushort[4] alpha = [src[0],src[0],src[0],src[0]]; //replace it
if there's a faster method for this
ushort[4] alpha = [100,100,100,100];
//src[3] = 255;
ubyte[4] *p2 = cast(ubyte[4]*)src2.ptr;
ushort[4] *p3 = cast(ushort[4]*)alpha.ptr;
ushort[4] *pc_1 = cast(ushort[4]*)alphaMMXmul_const1.ptr;
ushort[4] *pc_256 = cast(ushort[4]*)alphaMMXmul_const256.ptr;
asm{
//moving the values to their destinations
mov ESI, p2[EBP];
mov EDI, p[EBP];
movd MM0, [ESI];
movd MM1, [EDI];
mov ESI, p3[EBP];
movq MM5, [ESI];
mov ESI, pc_256[EBP];
movq MM7, [ESI];
mov ESI, pc_1[EBP];
movq MM6, [ESI];
punpcklbw MM2, MM0;
punpcklbw MM3, MM1;
paddw MM6, MM5; //1 + alpha
psubw MM7, MM5; //256 - alpha
//psllw MM2, 2;
//psllw MM3, 2;
psrlw MM6, 1;
psrlw MM7, 1;
pmullw MM2, MM6; //src * (1 + alpha)
pmullw MM3, MM7; //dest * (256 - alpha)
paddw MM3, MM2; //(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw MM3, 8; //(src * (1 + alpha)) + (dest * (256 - alpha)) /
256
//moving the result to its place;
packuswb MM4, MM3;
movd [EDI-3], MM4;
emms;
}
Tried to get the correct result with trial and error, but there's
no real improvement.
More information about the Digitalmars-d-learn
mailing list