Taking pipeline processing to the next level
finalpatch via Digitalmars-d
digitalmars-d at puremagic.com
Tue Sep 6 18:04:06 PDT 2016
On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote:
>> The end of a scan line is special cased . If I need 12 pixels
>> for the last iteration but there are only 8 left, an instance
>> of Kernel::InputVector is allocated on stack, 8 remaining
>> pixels are memcpy into it then send to the kernel. Output from
>> kernel are also assigned to a stack variable first, then
>> memcpy 8 pixels to the output buffer.
> Right, and this is a classic problem with this sort of
> function; it is
> only more efficient if numElements is suitable long.
> See, I often wonder if it would be worth being able to provide
> functions, a scalar and array version, and have the algorithms
> between them intelligently.
We normally process full HD or higher resolution images so the
overhead of having to copy the last iteration was negligible.
It was fairly easy to put together a scalar version as they are
much easier to write than the SIMD ones. In fact I had scalar
version for every SIMD kernel, and use them for unit testing.
It shouldn't be hard to have the framework look at the buffer
size and choose the scalar version when number of elements are
small, it wasn't done that way simply because we didn't need it.
More information about the Digitalmars-d