auto vectorization notes

Sat Mar 28 22:22:27 UTC 2020

On Saturday, 28 March 2020 at 18:01:37 UTC, Crayo List wrote:
> On Saturday, 28 March 2020 at 06:56:14 UTC, Bruce Carneal wrote:
>> On Saturday, 28 March 2020 at 05:21:14 UTC, Crayo List wrote:
>>> On Monday, 23 March 2020 at 18:52:16 UTC, Bruce Carneal wrote:
>>>> [snip]
>> Explicit SIMD code, ispc or other, isn't as readable or 
>> composable or vanilla portable but it certainly is performance 
>> predictable.
>
> This is not true! The idea of ispc is to write portable code 
> that will
> vectorize predictably based on the target CPU. The object 
> file/binary is not portable,
> if that is what you meant.
> Also, I find it readable.
>

There are many waypoints on the readability <==> performance 
axis.  If ispc works for you along that axis, great!

>> I find SIMT code readability better than SIMD but a little 
>> worse than auto-vectorizable kernels.  Hugely better 
>> performance though for less effort than SIMD if your platform 
>> supports it.
>
> Again I don't think this is true. Unless I am misunderstanding 
> you, SIMT and SIMD
> are not mutually exclusive and if you need performance then you 
> must use both.
> Also based on the workload and processor SIMD may be much more 
> effective than SIMT.j

SIMD might become part of the solution under the hood for a 
number of reasons including: ease of deployment, programmer 
familiarity, PCIe xfer overheads, kernel launch overhead, memory 
subsystem suitability, existing code base issues, ...

SIMT works for me in high throughput situations where it's hard 
to "take a log" on the problem.  SIMD, in auto-vectorizable or 
more explicit form, works in others.

Combinations can be useful but most of the work I've come in 
contact with splits pretty clearly along the memory bandwidth 
divide (SIMT on one side, SIMD/CPU on the other).  Others need a 
plus-up in arithmetic horsepower.  The more extreme the 
requirements, the more attractive SIMT appears. (hence my 
excitement about dcompute possibly expanding the dlang 
performance envelope with much less cognitive load than 
CUDA/OpenCL/SycL/...)

On the readability front, I find per-lane programming, even with 
the current thread-divergence caveats, to be easier to reason 
about wrt correctness and performance predictability than other 
approaches.  Apparently your mileage does vary.

When you have chosen SIMD, whether ispc or other, over SIMT what 
drove the decision?  Performance?  Ease of programming to reach a 
target speed?