Taking pipeline processing to the next level

Tue Sep 6 18:04:06 PDT 2016

On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote:
>> The end of a scan line is special cased . If I need 12 pixels 
>> for the last iteration but there are only 8 left, an instance 
>> of Kernel::InputVector is allocated on stack, 8 remaining 
>> pixels are memcpy into it then send to the kernel. Output from 
>> kernel are also assigned to a stack variable first, then 
>> memcpy 8 pixels to the output buffer.
>
> Right, and this is a classic problem with this sort of 
> function; it is
> only more efficient if numElements is suitable long.
> See, I often wonder if it would be worth being able to provide 
> both
> functions, a scalar and array version, and have the algorithms 
> select
> between them intelligently.

We normally process full HD or higher resolution images so the 
overhead of having to copy the last iteration was negligible.

It was fairly easy to put together a scalar version as they are 
much easier to write than the SIMD ones.  In fact I had scalar 
version for every SIMD kernel,  and use them for unit testing.

It shouldn't be hard to have the framework look at the buffer 
size and choose the scalar version when number of elements are 
small, it wasn't done that way simply because we didn't need it.