Scientific computing and parallel computing C++23/C++26

Thu Jan 20 13:29:26 UTC 2022

On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal wrote:
> Because compilers are not sufficiently advanced to extract all 
> the performance that is available on their own.

Well, but D developers cannot test on all available CPU/GPU 
combinations either so then you don't know if SIMD would perform 
better than GPU.

Something automated has to be present, at least on install, 
otherwise you risk performance degradation compared to a pure 
SIMD implementation. And then it is better (and cheaper) to just 
avoid GPU altogether.

> A good example of where the automated/simple approach was not 
> good enough is CUB (CUDA unbound), a high performance CUDA 
> library found here https://github.com/NVIDIA/cub/tree/main/cub
>
> I'd recommend taking a look at the specializations that occur 
> in CUB in the name of performance.

I am sure you are right, but I didn't find anything special when 
I browsed through the repo?

> If you can achieve your performance objectives with automated 
> or hinted solutions, great!  But what if you can't?

Well, my gut instinct is that if you want maximal performance for 
a specific GPU then you would be better off using 
Metal/Vulkan/etc directly?

But I have no experience with that as it is quite time consuming 
to go that route. Right now basic SIMD is time consuming enough… 
(but OK)