Optimization tips for alpha blending / rasterization loop

Thu Nov 21 18:24:54 PST 2013

I'm trying to learn some software rasterization stuff. Here's 
what I'm doing:

32-bit DMD on 64-bit Windows
Framebuffer is an int[], each int is a pixel of format 0xAABBGGRR 
(this seems fastest to my CPU + GPU)
Framebuffer is thrown as is to OpenGL, rendered as textured quad.

Here's a simple rectangle drawing algorithm that also does alpha 
blending. I tried quite a many variations (for example without 
the byte casting, using ints and shifting instead), but none was 
as fast as this:

class Framebuffer
{
   int[] data;
   int width;
   int height;
}

void drawRectangle(Framebuffer framebuffer, int x, int y, int 
width, int height, int color)
{
   foreach (i; y .. y + height)
   {
     int start = x + i * framebuffer.width;

     foreach(j; 0 .. width)
     {
       byte* bg = cast(byte*)&framebuffer.data[start + j];
       byte* fg = cast(byte*)&color;

       int alpha = (fg[3] & 0xff) + 1;
       int inverseAlpha = 257 - alpha;

       bg[0] = cast(byte)((alpha * (fg[0] & 0xff) + inverseAlpha * 
(bg[0] & 0xff)) >> 8);
       bg[1] = cast(byte)((alpha * (fg[1] & 0xff) + inverseAlpha * 
(bg[1] & 0xff)) >> 8);
       bg[2] = cast(byte)((alpha * (fg[2] & 0xff) + inverseAlpha * 
(bg[2] & 0xff)) >> 8);
       bg[3] = cast(byte)0xff;
     }
   }
}

I would like to make this as fast as possible as it is done for 
almost every pixel every frame.

Am I doing something stupid that is slowing things down? Cache 
trashing, or even branch prediction errors? :)
Is this kind of algorith + data even a candidate for SIMD usage?
Even if fg is of type byte, fg[0] would return greater value than 
0xff. It needs to be (fg[0] & 0xff) to make things work. I wonder 
why?