color lib

Sun Oct 9 06:18:22 PDT 2016

On 9 October 2016 at 15:34, Ilya Yaroshenko via Digitalmars-d
<digitalmars-d at puremagic.com> wrote:
> On Sunday, 9 October 2016 at 05:21:32 UTC, Manu wrote:
>>
>> On 9 October 2016 at 14:03, Nicholas Wilson via Digitalmars-d
>> <digitalmars-d at puremagic.com> wrote:
>>>
>>> [...]
>>
>>
>> Well the trouble is the lambda that you might give to 'map' won't work
>> anymore. Operators don't work on batches, you need to use a completely
>> different API, and I think that's unfortunate.
>
>
> Could you please give an example what type of operation should be
> vectorized?

Let's consider a super simple blend:
  dest = src.rgb * src.a + dest.rgb * (1-src.alpha);

This is perhaps the most common blend that exists. If this is a
ubyte[4] color, which is the most common format, then to do it
efficiently, runs of 16 colors (4x ubyte[16] vectors), needs to be
rearranged into:
  ubyte[16][3] rgb = [ [RGBRGBRGBRGBRGBR], [GBRGBRGBRGBRGBRG],
[BRGBRGBRGBRGBRGB] ];
  ubyte[16][3] a = [ [AAAaaaAAAaaaAAAa], [aaAAAaaaAAAaaaAA],
[AaaaAAAaaaAAAaaa] ];
You can do this with gather loads, or with a couple of shuffle's after loading.
Then obviously do the work in this configuration.

Or you might expand it to [ [RRRRRRRRRRRRRRRR], [GGGGGGGGGGGGGGGG],
[BBBBBBBBBBBBBBBB] ], etc, depends on the work, and which expansion is
cheaper for the platform (ie, shuffling limitations).

Now, this might not look like much of a win for this blend, but as you
extend the sequence of ops, the win gets much much bigger.
Particularly so if you want to do gamma-correct stuff, which would
usually involve expanding those ubytes into floats, then doing vector
pow's and stuff like that. Either way, you need to iterate the image 4
vectors at a time.

That's the sort of batching I'm talking about. Trouble is, this work
needs to be wrapped into a function that receives inputs in batches,
like:
  RGBA8[16] doBulkBlend(RGBA8[16] buffer) { ... bulk blend code ... }

This sort of thing:
  buffer.map!(e => src.rgb * src.a).copy(output);
Super readable! Would be really nice to express, but I have no idea
how we can make that sort of thing efficient.

You could start writing this sort of thing:
  buffer.chunksOf!16.map!(e => doBulkBlend(e[0..16])).deChunk.copy(output);

Yeah, it's code... it would compile, but I consider that to be
completely obfuscated. You can't look at that and understand anything
much about what it does... so I don't think that's a good goal-post at
all.
If I showed that to a colleague, I don't think they'd be impressed. We
can't reach that point and say D is awesome for data-stream
processing... we need to go a lot further than that.

Anyway, I think this sort of thing is a minimum target. I'd like to
see how this sort of batching would integrate into ndslice nicely,
because it introduces the 'nd' iteration element... there's a heap of
challenges; element alignment, unaligned line-strides, mid-vector
slices, etc.
I can imagine certain filter algorithms that work on 2d slices
('blocks') rather than 1d slices like the example above. What if the
image is rotated or transposed? How can applying a per-pixel operation
on a buffer iterate the memory-linear fashion even though it's working
in batched of elements?

I haven't sat and tried plugging this into ndslice much yet. Haven't
had enough time, and I really wanted to get colour to a point I'm
happy with.