Optimization tips for alpha blending / rasterization loop

Fri Nov 22 06:55:52 PST 2013

> Do you want to use a ubyte instead of a byte here?

Yes, that was a silly mistake. It seems that fixing that removed 
the need for all the masking operations, which had the biggest 
speedup.

> Also, for your alpha channel:
>
> int alpha = (fg[3] & 0xff) + 1;
> int inverseAlpha = 257 - alpha;
>
> If fg[3] = 0 then inverseAlpha = 256, which is out of the range
> that can be stored in a ubyte.

I think my logic should be correct. The calculations are done 
with ints, and the result is then just casted/clamped to the 
byte. The reason for the +1 is the >> 8, which divides by 256.

class Framebuffer
{
   uint[] framebufferData;
   uint framebufferWidth;
   uint framebufferHeight;
}

void drawRectangle(Framebuffer framebuffer, uint x, uint y, uint 
width, uint height, uint color)
{
   immutable ubyte* fg = cast(immutable ubyte*)&color;
   immutable uint alpha = fg[3] + 1;
   immutable uint invAlpha = 257 - alpha;
   immutable uint afg0 = alpha * fg[0];
   immutable uint afg1 = alpha * fg[1];
   immutable uint afg2 = alpha * fg[2];

   foreach (i; y .. y + height)
   {
     uint start = x + i * framebuffer.width;

     foreach(j; 0 .. width)
     {
       ubyte* bg = cast(ubyte*)(&framebuffer.data[start + j]);

       bg[0] = cast(ubyte)((afg0 + invAlpha * bg[0]) >> 8);
       bg[1] = cast(ubyte)((afg1 + invAlpha * bg[1]) >> 8);
       bg[2] = cast(ubyte)((afg2 + invAlpha * bg[2]) >> 8);
       bg[3] = 0xff;
     }
   }
}

Can this be made faster with SIMD? (I don't know much about it, 
maybe the data and algorithm doesn't fit it?)

Can this be parallelized with any real gains?