Scientific computing and parallel computing C++23/C++26

Thu Jan 20 12:18:27 UTC 2022

On Thursday, 20 January 2022 at 08:36:32 UTC, Ola Fosheim Grøstad 
wrote:
> On Thursday, 20 January 2022 at 08:20:58 UTC, Nicholas Wilson 
> wrote:
>>  Now you've confused me. You can select which implementation 
>> to use at runtime with e.g. CPUID or more sophisticated 
>> methods. LDC targeting DCompute can produce multiple objects 
>> with the same compiler invocation, i.e. you can get CUDA for 
>> any set of SM version, OpenCL compatible SPIR-V which you can 
>> get per GPU, inspect its hardware characteristics and then 
>> select which of your kernels to run.
>
> Yes, so why do you need compile time features?

Because compilers are not sufficiently advanced to extract all 
the performance that is available on their own.

A good example of where the automated/simple approach was not 
good enough is CUB (CUDA unbound), a high performance CUDA 
library found here https://github.com/NVIDIA/cub/tree/main/cub

I'd recommend taking a look at the specializations that occur in 
CUB in the name of performance.

D compile time features can help reduce this kind of mess, both 
in extreme performance libraries and extreme performance code.

>
> My understanding is that the goal of nvc++ is to compile to CPU 
> or GPU based on what pays of more for the actual code. So it 
> will not need any annotations (it is up to the compiler to 
> choose between CPU/GPU?). Bryce suggested that it currently 
> only targets one specific GPU, but that it will target multiple 
> GPUs for the same executable in the future.
>
> The goal for C++ parallelism is to make it fairly transparent 
> to the programmer. Or did I misunderstand what he said?

I think that that is an entirely reasonable goal but such 
transparency may cost performance and any such cost will be 
unacceptable to some.

>
> My viewpoint is that if one are going to take a performance hit 
> by not writing the shaders manually one need to get maximum 
> convenience as a payoff.
>
> It should be an alternative for programmers that cannot afford 
> to put in the extra time to support GPU compute manually.

Yes.  Always good to have alternatives.  Fully automated is one 
option, hinted is a second alternative, meta-programming assisted 
manual is a third.

>
>
>>> If you have to do the unrolling in D, then a lot of the 
>>> advantage is lost and I might just as well write in a shader 
>>> language...
>>
>> D can be your compute shading language for Vulkan and with a 
>> bit of work whatever you'd use HLSL for, it can also be your 
>> compute kernel language substituting for OpenCL and CUDA.
>
> I still don't understand why you would need static if/static 
> for-loops? Seems to me that this is too hardwired, you'd be 
> better off with compiler unrolling hints (C++ has these) if the 
> compiler does the wrong thing.

If you can achieve your performance objectives with automated or 
hinted solutions, great!  But what if you can't?  Most people 
will not have to go as hardcore as the CUB authors did to get the 
performance they need but I find myself wanting more than the 
compiler can easily give me quite a bit.  I'm very happy to have 
the meta programming tools to factor/reduce these "manual" 
programming task.

>
>
>> Same caveats apply for metal (should be pretty easy to do: 
>> need Objective-C support in LDC, need Metal bindings).
>
> Use clang to compile the objective-c code to object files and 
> link with it?