Taking pipeline processing to the next level

Mon Sep 5 20:08:43 PDT 2016

On 5 September 2016 at 23:38, Andrei Alexandrescu via Digitalmars-d
<digitalmars-d at puremagic.com> wrote:
> On 9/5/16 1:43 PM, Ethan Watson wrote:
>>
>> On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote:
>>>
>>> What are the benchmarks and the numbers? What loss are you looking at?
>>> -- Andrei
>>
>>
>> Just looking at the example, and referencing the map code in
>> std.algorithm.iteration, I can see multiple function calls instead of
>> one thanks to every indexing of the new map doing a transformation
>> instead of caching it. I'm not sure if the lambda declaration there will
>> result in the argument being taken by ref or by value, but let's assume
>> by value for the sake of argument. Depending on if it's taking by value
>> a reference or a value type, that could either be a cheap function call
>> or an expensive one.
>>
>> But even if it took it by reference, it's still a function call.
>> Function calls are generally The Devil(TM) in a gaming environment. The
>> less you can make, the better.
>>
>> Random aside: There are streaming store instructions available to me on
>> x86 platforms so that I don't have to wait for the destination to hit L1
>> cache before writing. The pattern Manu talks about with a batching
>> function can better exploit this. But I imagine copy could also take
>> advantage of this when dealing with value types.
>
>
> Understood. Would explicitly asking for vectorized operations be acceptable?
> One school of thought has it that explicit invocation of parallel operations
> are preferable to autovectorization and its ilk. -- Andrei

I still stand by this, and I listed some reasons above.
Auto-vectorisation is a nice opportunistic optimisation, but it can't
be relied on. The key reason is that scalar arithmetic semantics are
different than vector semantics, and auto-vectorisation tends to
produce a whole bunch of extra junk code to carefully (usually
pointlessly) preserve the scalar semantics that it's trying to
vectorise. This will never end well.
But the vectorisation isn't the interesting problem here, I'm really
just interested in how to work these batch-processing functions into
our nice modern pipeline statements without placing an unreasonable
burden on the end-user, who shouldn't be expected to go out of their
way. If they even have to start manually chunking, I think we've
already lost; they won't know optimal chunk-sizes, or anything about
alignment boundaries, cache, etc.