DConf 2013 Day 3 Talk 5: Effective SIMD for modern architectures by Manu Evans

Thu Jun 20 18:52:56 PDT 2013

On 21 June 2013 00:03, bearophile <bearophileHUGS at lycos.com> wrote:

> Manu:
>
>
>  They must be aligned, and multiples of N elements.
>>
>
> The D GC currently allocates them 16-bytes aligned (but if you slice the
> array you can lose some alignment). On some new CPUs the penalty for
> misalignment is small.
>

Yes, the GC allocates 16byte aligned memory, this is good. It's critical
actually. But if the data types themselves weren't aligned, then the alloc
alignment would be lost as soon as they were used in struct's.

You'll notice I made a point of focusing on _portable_ simd. It's true,
some new chips can deal with it at virtually no additional cost, but they
lose nothing by aligning their data regardless, and you can run on anything.
I hope that people write libraries that can run well on anything, not just
their architecture of choice. The guidelines I presented, if followed, will
give you good performance on all architectures.
They're not even very inconvenient.

If your point is about auto-vectorisation being much simpler without the
alignment restrictions, this is true. But again, I'm talking about portable
and RELIABLE implementations, that is, the programmer should know that SIMD
was used effectively, and not have to hope the optimiser was able to do a
good job. Make these guidelines second nature, and you'll foster a habit of
writing portable code even if you don't intend to do so personally. Someone
somewhere may want to use your library...

You often have "n" values, where n is variable. If n is large enough and
> you are using D vector ops, the handling of the head and tail doesn't waste
> too much time. If you have very few values it's much better to use the SIMD
> code.

See my later slides about branch predictability. When you need to handle
stragglers on the head or tail, then you've introduced 2 sources of
unpredictability (and also bloated your code).
If the arrays are very long, this may be okay as you say, but if they're
not it becomes significant.

But there is an new issue that appears; if the output array is not the same
as the input array, then you have a new mis-alignment where the bases of
the 2 arrays might not share the same alignment, and you can't do a simd
load from one and store to the other without a series of corrective shifts
and merges, which will effectively result in similar code to my un-aligned
load demonstration.

So the case where this is reliable is:
 * long data array
 * output array is the same as the input array (overwrites the input?)

I don't consider that reliable, and I don't think special-cases awareness
of those criteria is any easier than carefully/deliberately using SIMD in
the first place.

 Well, each are valid comparisons in different situations. I'm not sure how
>> syntax could clearly select the one you want.
>>
>
> Maybe later we'll look for some syntax sugar for this.
>

I'm definitely curious... but i'm not sure it's necessary.

 Are D intrinsics offering instructions to perform prefetching?
>>>
>>
>> Well, GCC does at least. If you're worried about performance at this
>> level, you're probably already using GCC :)
>>
>
> I think D SIMD programmers will expect something functionally like
> __builtin_prefetch to be available in D too:
> http://gcc.gnu.org/onlinedocs/**gcc/Other-Builtins.html#index-**
> g_t_005f_005fbuiltin_**005fprefetch-3396<http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-g_t_005f_005fbuiltin_005fprefetch-3396>

Yup, I toyed with the idea of adding it to std.simd, but I didn't think it
fit there.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.puremagic.com/pipermail/digitalmars-d-announce/attachments/20130621/bf57d7ed/attachment.html>