Very simple SIMD programming

Wed Oct 24 15:46:02 PDT 2012

On 25 October 2012 01:00, bearophile <bearophileHUGS at lycos.com> wrote:

> Manu:
>
>
>  The compiler would have to do some serious magic to optimise that;
>> flattening both sides of the if into parallel expressions, and then
>> applying the mask to combine...
>>
>
> I think it's a small amount of magic.
>
> The simple features shown in that paper are fully focused on SIMD
> programming, so they aren't introducing things clearly not efficient.
>
>
>  I'm personally not in favour of SIMD constructs that are anything less
>> than
>> optimal (but I appreciate I'm probably in the minority here).
>>
>>
>> (The simple benchmarks of the paper show a 5-15% performance loss compared
>>
>>> to handwritten SIMD code.)
>>>
>>>
>> Right, as I suspected.
>>
>
> 15% is a very small performance loss, if for the programmer the
> alternative is writing scalar code, that is 2 or 3 times slower :-)
>
> The SIMD programmers that can't stand a 1% loss of performance use the
> intrinsics manually (or write in asm) and they ignore all other things.
>
> A much larger population of system programmers wish to use modern CPUs
> efficiently, but they don't have time (or skill, this means their programs
> are too much often buggy) for assembly-level programming. Currently they
> use smart numerical C++ libraries, use modern Fortran versions, and/or
> write C/C++ scalar code (or Fortran), add "restrict" annotations, and take
> a look at the produced asm hoping the modern compiler back-ends will
> vectorize it. This is not good enough, and it's far from a 15% loss.
>
> This paper shows a third way, making such kind of programming simpler and
> approachable for a wider audience, with a small performance loss compared
> to handwritten code. This is what language designers do since 60+ years :-)
>

I don't disagree with you, it is fairly cool!
I can't can't imagine D adopting those sort of language features any time
soon, but it's probably possible.
I guess the keys are defining the bool vector concept, and some tech to
flatten both sides of a vector if statement, but that's far from simple...
Particularly so if someone puts some unrelated code in those if blocks.
Chances are it offers too much freedom that wouldn't be well used or
understood by the average programmer, and that still leaves you in a
similar land of only being particularly worthwhile in the hands of a fairly
advanced/competent user.
The main error that most people make is thinking SIMD code is faster by
nature. Truth is, in the hands of someone who doesn't know precisely what
they're doing, SIMD code is almost always slower.
There are some cool new expressions offered here, fairly convenient
(although easy[er?] to write in other ways too), but I don't think it would
likely change that fundamental premise for the average programmer beyond
some very simple parallel constructs that the compiler can easily get right.
I'd certainly love to see it, but is it realistic that someone would take
the time to do all of that any time soon when benefits
are controversial? It may even open the possibility for un-skilled people
to write far worse code.

Let's consider your example above for instance, I would rewrite (given
existing syntax):

// vector length of context = 1; current_mask = T
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast
uint4 m = maskLess(v, w); // [T,F,F,T] (T == ones, F == zeroes)
v += int4(1); // [1,4,5,2]

// the if block is trivially rewritten:
int4 trueSide = v + int4(2);
int4 falseSize = v + int4(3);
v = select(m, trueSide, falseSide); // [3,7,8,4]

Or the whole thing further simplified:
int4 v = [0,3,4,1];
int4 w = 3; // [3,3,3,3] via broadcast

// one convenient function does the comparison and select accordingly
v = selectLess(v, w, v + int4(1 + 2), v + int4(1 + 3)); // combine the
prior few lines

I actually find this more convenient. I also find the if syntax you
demonstrate to be rather deceptive and possibly misleading. 'if' suggests a
branch, whereas the construct you demonstrate will evaluate both sides
every time. Inexperienced programmers may not really grasp that. Evaluating
the true side and the false side inline, and then perform the select
serially is more honest; it's actually what the computer will do, and I
don't really see it being particularly less convenient either.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20121025/1ef8383f/attachment.html>