Taking pipeline processing to the next level

Tue Sep 6 07:21:01 PDT 2016

On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote:
> I still stand by this, and I listed some reasons above.
> Auto-vectorisation is a nice opportunistic optimisation, but it 
> can't
> be relied on. The key reason is that scalar arithmetic 
> semantics are
> different than vector semantics, and auto-vectorisation tends to
> produce a whole bunch of extra junk code to carefully (usually
> pointlessly) preserve the scalar semantics that it's trying to
> vectorise. This will never end well.
> But the vectorisation isn't the interesting problem here, I'm 
> really
> just interested in how to work these batch-processing functions 
> into
> our nice modern pipeline statements without placing an 
> unreasonable
> burden on the end-user, who shouldn't be expected to go out of 
> their
> way. If they even have to start manually chunking, I think we've
> already lost; they won't know optimal chunk-sizes, or anything 
> about
> alignment boundaries, cache, etc.

In a previous job I had successfully created a small c++ library 
to perform pipelined SIMD image processing. Not sure how relevant 
it is but think I'd share the design here, perhaps it'll give you 
guys some ideas.

Basically the users of this library only need to write simple 
kernel classes, something like this:

// A kernel that processes 4 pixels at a time
struct MySimpleKernel : Kernel<4>
{
     // Tell the library the input and output type
     using InputVector  = Vector<__m128, 1>;
     using OutputVector = Vector<__m128, 2>;

     template<typename T>
     OutputVector apply(const T& src)
     {
         // T will be deduced to Vector<__m128, 1>
         // which is an array of one __m128 element
         // Awesome SIMD code goes here...
         // And return the output vector
         return OutputVector(...);
     }
};

Of course the InputVector and OutputVector do not have to be 
__m128, they can totally be other types like int or float.

The cool thing is kernels can be chained together with >> 
operators.

So assume we have another kernel:

struct AnotherKernel : Kernel<3>
{
...
}

Then we can create a processing pipeline with these 2 kernels:

InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >> 
OutputBuffer(...);

Then some template magic will figure out the LCM of the 2 
kernels' pixel width is 3*4=12 and therefore they are fused 
together into a composite kernel of pixel width 12.  The above 
line compiles down into a single function invokation, with a main 
loop that reads the source buffer in 4 pixels step, call 
MySimpleKernel 3 times, then call AnotherKernel 4 times.

Any number of kernels can be chained together in this way, as 
long as your compiler doesn't explode.

At that time, my benchmarks showed pipelines generated in this 
way often rivals the speed of hand tuned loops.