Taking pipeline processing to the next level
finalpatch via Digitalmars-d
digitalmars-d at puremagic.com
Tue Sep 6 07:21:01 PDT 2016
On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:
> I still stand by this, and I listed some reasons above.
> Auto-vectorisation is a nice opportunistic optimisation, but it
> can't
> be relied on. The key reason is that scalar arithmetic
> semantics are
> different than vector semantics, and auto-vectorisation tends to
> produce a whole bunch of extra junk code to carefully (usually
> pointlessly) preserve the scalar semantics that it's trying to
> vectorise. This will never end well.
> But the vectorisation isn't the interesting problem here, I'm
> really
> just interested in how to work these batch-processing functions
> into
> our nice modern pipeline statements without placing an
> unreasonable
> burden on the end-user, who shouldn't be expected to go out of
> their
> way. If they even have to start manually chunking, I think we've
> already lost; they won't know optimal chunk-sizes, or anything
> about
> alignment boundaries, cache, etc.
In a previous job I had successfully created a small c++ library
to perform pipelined SIMD image processing. Not sure how relevant
it is but think I'd share the design here, perhaps it'll give you
guys some ideas.
Basically the users of this library only need to write simple
kernel classes, something like this:
// A kernel that processes 4 pixels at a time
struct MySimpleKernel : Kernel<4>
{
// Tell the library the input and output type
using InputVector = Vector<__m128, 1>;
using OutputVector = Vector<__m128, 2>;
template<typename T>
OutputVector apply(const T& src)
{
// T will be deduced to Vector<__m128, 1>
// which is an array of one __m128 element
// Awesome SIMD code goes here...
// And return the output vector
return OutputVector(...);
}
};
Of course the InputVector and OutputVector do not have to be
__m128, they can totally be other types like int or float.
The cool thing is kernels can be chained together with >>
operators.
So assume we have another kernel:
struct AnotherKernel : Kernel<3>
{
...
}
Then we can create a processing pipeline with these 2 kernels:
InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >>
OutputBuffer(...);
Then some template magic will figure out the LCM of the 2
kernels' pixel width is 3*4=12 and therefore they are fused
together into a composite kernel of pixel width 12. The above
line compiles down into a single function invokation, with a main
loop that reads the source buffer in 4 pixels step, call
MySimpleKernel 3 times, then call AnotherKernel 4 times.
Any number of kernels can be chained together in this way, as
long as your compiler doesn't explode.
At that time, my benchmarks showed pipelines generated in this
way often rivals the speed of hand tuned loops.
More information about the Digitalmars-d
mailing list