Taking pipeline processing to the next level

Wed Sep 7 03:31:13 PDT 2016

On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote:
>> The lesson I learned from this is that you need the user code 
>> to provide a lot of extra information about the algorithm at 
>> compile time for the templates to work out a way to fuse 
>> pipeline stages together efficiently.
>>
>> I believe it is possible to get something similar in D because 
>> D has more powerful templates than C++ and D also has some 
>> type introspection which C++ lacks.  Unfortunately I'm not as 
>> good on D so I can only provide some ideas rather than actual 
>> working code.
>>
>> Once this problem is solved, the benefit is huge.  It allowed 
>> me to perform high level optimizations (streaming load/save, 
>> prefetching, dynamic dispatching depending on data alignment 
>> etc.) in the main loop which automatically benefits all 
>> kernels and pipelines.
>
> Exactly!

I think the problem here is two fold.

First question, how do we combine pipeline stages with minimal 
overhead

I think the key to this problem is reliable *forceinline*

for example, a pipeline like this

input.map!(x=>x.f1().f2().f3().store(output));

if we could make sure f1(), f2(), f3(), store(), and map() itself 
are all inlined, then we end up with a single loop with no 
function calls and the compiler is free to perform cross function 
optimizations. This is about as good as you can get.  
Unfortunately at the moment I hear it's difficult to make sure D 
functions get inlined.

Second question, how do we combine SIMD pipeline stages with 
minimal overhead

Besides reliable inlining, we also need some template code to 
repeat stages until their strides match. This requires details 
about each stage's logical unit size, input/output type and size 
at compile time. I can't think of what the interface of this 
would look like but the current map!() is likely insufficient to 
support this.

I still don't believe auto-select between scalar or vector paths 
would be a very useful feature. Normally I would only consider 
SIMD solution when I know in advance that this is a performance 
hotspot. When the amount of data is small I simply don't care 
about performance and would just choose whatever simplest way to 
do it, like map!(), because the performance impact is not 
noticeable and definitely not worth the increased complexity.