Scientific computing and parallel computing C++23/C++26

Thu Jan 20 17:43:22 UTC 2022

On Thursday, 20 January 2022 at 13:29:26 UTC, Ola Fosheim Grøstad 
wrote:
> On Thursday, 20 January 2022 at 12:18:27 UTC, Bruce Carneal 
> wrote:
>> Because compilers are not sufficiently advanced to extract all 
>> the performance that is available on their own.
>
> Well, but D developers cannot test on all available CPU/GPU 
> combinations either so then you don't know if SIMD would 
> perform better than GPU.

It can be very expensive to write and test all the permutations, 
yes, but often you'll understand the bottlenecks of your 
algorithms sufficiently to be able to correctly filter out the 
work up front.  Restating here, these are a few of the 
traditional ways to look at it: Throughput or latency limited?  
Operand/memory or arithmetic limited?  Power (watts) preferred or 
other performance?

It's possible, for instance, that you can *know*, from first 
principles, that you'll never meet objective X if forced to use 
platform Y.  In general, though, you'll just have a sense of the 
order in which things should be evaluated.

>
> Something automated has to be present, at least on install, 
> otherwise you risk performance degradation compared to a pure 
> SIMD implementation. And then it is better (and cheaper) to 
> just avoid GPU altogether.

Yes, SIMD can be the better performance choice sometimes.  I 
think that many people will choose to do a SIMD implementation as 
a performance, correctness testing and portability baseline 
regardless of the accelerator possibilities.

>
>> A good example of where the automated/simple approach was not 
>> good enough is CUB (CUDA unbound), a high performance CUDA 
>> library found here https://github.com/NVIDIA/cub/tree/main/cub
>>
>> I'd recommend taking a look at the specializations that occur 
>> in CUB in the name of performance.
>
> I am sure you are right, but I didn't find anything special 
> when I browsed through the repo?

The key thing to note is how much effort the authors put into 
specialization wrt the HW x SW cross product. There are entire 
subdirectories devoted to specialization.

At least some of this complexity, this programming burden, can be 
factored out with better language support.

>
>> If you can achieve your performance objectives with automated 
>> or hinted solutions, great!  But what if you can't?
>
> Well, my gut instinct is that if you want maximal performance 
> for a specific GPU then you would be better off using 
> Metal/Vulkan/etc directly?

That's what seems reasonable, yes, but fortunately I don't think 
it's correct.  By analogy, you *can* get maximum performance from 
assembly level programming, if you have all the compiler back-end 
knowledge in your head, but if your language allows you to 
communicate all relevant information (mainly dependencies and 
operand localities but also "intrinsics") then the compiler can 
do at least as well as the assembly level programmer.  Add 
language support for inline and factored specialization and the 
lower level alternatives become even less attractive.

>
> But I have no experience with that as it is quite time 
> consuming to go that route. Right now basic SIMD is time 
> consuming enough… (but OK)

Indeed.  I'm currently working on the SIMD variant of something I 
partially prototyped earlier on a 2080 and it has been slow going 
compared to either that GPU implementation or the scalar/serial 
variant.

There are some very nice assists from D for SIMD programming: the 
__vector typing,  __vector arithmetic, unaligned vector 
loads/stores via static array operations, static foreach to 
enable portable expression of single-instruction SIMD functions 
like min, max, select, various shuffles, masks, ...  but, yes, 
SIMD programming is definitely a slog compared to either scalar 
or SIMT GPU programming.