Taking pipeline processing to the next level

Mon Sep 5 19:56:32 PDT 2016

On 5 September 2016 at 18:21, Andrei Alexandrescu via Digitalmars-d
<digitalmars-d at puremagic.com> wrote:
> On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote:
>>
>> I mostly code like this now:
>>   data.map!(x => transform(x)).copy(output);
>>
>> It's convenient and reads nicely, but it's generally inefficient.
>
>
> What are the benchmarks and the numbers? What loss are you looking at? --
> Andrei

Well, it totally depends. Like right now in my case, 'transform' is
some image processing code (in the past when I've had these same
thoughts, it has been audio filters). You can't touch pixels (or
samples) one at a time. They need manual SIMD deployment (I've never
seen an auto-vectoriser handle saturation arithmetic, or type
promotion), alpha components (every 4th byte) is treated differently,
memory access patterns need to be tuned to be cache friendly.

I haven't done benchmarks right now, but I've done them professionally
in the past, and it's not unusual to expect a hand-written image
processing loop to see 1 or even 2 orders of magnitude improvement
when hand written, compared to calling a function for each pixel in a
loop. The sorts of low-level optimisations you deploy in image and
audio processing loops are not things I've ever seen any optimiser
even attempt.
Some core problems that tend to require manual intervention in hot-loops are:
  ubyte[16] <-> ushort[8][2] expansion/contraction
  ubyte[16] <-> float[4][4] expansion/contraction
  saturation
  scalar operator results promote to int, but wide-simd operations
don't, which means some scalar expressions can't express losslessly
collapsed to simd operations, and the compiler will always be
conservative on this matter. If the auto-vectoriser tries at all, you
will see a mountain of extra code to preserve those bits that the
scalar operator semantics would have guaranteed
  wide-vector multiplication is semantically different than scalar
multiplication, so the optimiser has a lot of trouble vectorising
mul's
  assumptions about data alignment
  interleaved data; audio samples are usually [L,R] interleaved,
images often [RGB,A], and different processes are applied across the
separation, you want to unroll and shuffle the data so you have
vectors [LLLL],[RRRR], or [RGBRGBRGBRGB],[AAAA], and I haven't seen an
optimiser go near that
  vector dot-product is always a nuisance

I could go on and on.

The point is, as an end-user, pipeline API's are great. At a library
author, I want to present the best performing library I can, which I
think means we need to find a way to conveniently connect these 2
currently disconnected worlds.
I've explored to some extent, but I've never come up with anything that I like.