Taking pipeline processing to the next level

Sun Sep 4 23:21:42 PDT 2016

On Mon, Sep 05, 2016 at 03:08:53PM +1000, Manu via Digitalmars-d wrote:
> I mostly code like this now:
>   data.map!(x => transform(x)).copy(output);
> 
> It's convenient and reads nicely, but it's generally inefficient.
> This sort of one-by-one software design is the core performance
> problem with OOP. It seems a shame to be suffering OOP's failures even
> when there is no OOP in sight.

In functional languages, where these ideas originated, it's mostly the
compiler's job to make things efficient.  I'd argue that we should
rather be looking at how to improve the compiler.  As far as I know,
currently even D compilers don't really take full advantage of the
range idiom; i.e., their optimizers are written for the general case of
arbitrary code, so while you do get some benefits, it may not
necessarily translate to overall efficient code.  Dmd is particularly
bad at optimizing range-based code, mainly because its inliner gives up
too easily.  Nest the range API calls a level too deep, and the inliner
fails to inline it, and as a result a whole series of subsequent
optimizations are missed.

IME, gdc does a bit better -- it's able to translate simple range
compositions into vectorized operations in some cases that I've seen.
Yes, I realize that there's some debate about whether auto-vectorization
is a good thing or not, but the point is that gdc's optimizer is well
able to translate what you call a one-by-one design into something
that's closer to block-by-block. So it's not out of the realm of
possibility that the compiler could do such things for you.

Now imagine if the D frontend is able to recognize certain code patterns
characteristic of range-based code, and optimize accordingly. In the
front end, you have access to the AST and to higher-level constructs in
the language, so it's more likely to be able to reliably detect
range-based code, and do something useful with it.  Conceptually, such
an optimizer could see code like this:

	data.map!(x => 2*x + 1)
	    .map!(x => f(x) - 1)
	    .copy(output);

and produce, as an initial step, something like:

	foreach (i, x; data) {
		output[i] = f(2*x + 1) - 1;
	}

Then this gets passed through the backend optimizer which does loop
optimizations like strength reduction, unrolling, etc.. The unrolled
loop could then get passed over another optimizer iteration to produce
vectorized code, or the equivalent transformation into block-based
operations.

There's a lot of optimization potential with range-based code that
currently isn't really exploited by existing D compilers. It would be
interesting to investigate what optimizations are possible and to
implement them in the D front-end.

> A central premise of performance-oriented programming which I've
> employed my entire career, is "where there is one, there is probably
> many", and if you do something to one, you should do it to many.  With
> this in mind, the code I have always written doesn't tend to look like
> this:
>   R manipulate(Thing thing);
> 
> Instead:
>   void manipulateThings(Thing *things, size_t numThings, R *output,
> size_t outputLen);
> 
> Written this way for clarity. Obviously, the D equiv uses slices.

Actually, isn't this the whole premise of range-based code? Instead of
dealing with individual items, you work with a range of items, and not
necessarily arrays either. Though for obvious reasons arrays would tend
to perform better than the general range.

> All functions are implemented with the presumption they will operate
> on many things, rather than being called many times for each one.
> This is the single most successful design pattern I have ever
> encountered wrt high-performance code; ie, implement the array version
> first.

In D, that would be "implement the range version first" (and then you
realize you don't need to implement any other version afterwards,
because the range version already covers everything else :-P).

> The problem with this API design, is that it doesn't plug into
> algorithms or generic code well.
>   data.map!(x => transformThings(&x, 1)).copy(output);
> 
> I often wonder how we can integrate this design principle conveniently
> (ie, seamlessly) into the design of algorithms, such that they can
> make use of batching functions internally, and transparently?
> 
> Has anyone done any work in this area?
> 
> Ideas?

One thought that occurs to me is to focus on optimizing code involving
.map.  Obviously, a naive translation of a UFCS chain involving .map
would involve calling the transformation function one item at a time.
But it doesn't have to be that way, especially if the function is a
lambda or inlineable.  Conceivably, the optimizer should first inline
the function, then unroll the loop, then realize that further
optimizations can be applied across multiple loop iterations to turn it
into batched code.

Of course, compiler improvements are kinda more in the long term; in the
short term, what you might be able to do is to explicitly work with the
range in blocks, i.e.:

	import std.range : chunks;
	data.chunks(chunkSize)	// split into batches
		.map!(chunk => transformThings(chunk))
		.joiner	// join transformed batches back into flat range
		.copy(output);

T

-- 
Кто везде - тот нигде.