Scientific computing and parallel computing C++23/C++26

Thu Jan 20 08:20:58 UTC 2022

On Thursday, 20 January 2022 at 06:57:28 UTC, Ola Fosheim Grøstad 
wrote:
> On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson 
> wrote:
>> I mean there are parametric attributes of the hardware, say 
>> for example cache size (or available registers for GPUs), that 
>> have a direct effect on how many times you can unroll the 
>> inner loop, say for a windowing function, and you want to ship 
>> optimised  code for multiple configurations of hardware.
>>
>> You can much more easily create multiple copies for different 
>> sized cache (or register availability) in D than you can in 
>> C++, because static foreach and static if >>> if constexpr.
>
> Hmm, I dont understand, the unrolling should happen at runtime 
> so that you can target all GPUs with one executable?

  Now you've confused me. You can select which implementation to 
use at runtime with e.g. CPUID or more sophisticated methods. LDC 
targeting DCompute can produce multiple objects with the same 
compiler invocation, i.e. you can get CUDA for any set of SM 
version, OpenCL compatible SPIR-V which you can get per GPU, 
inspect its hardware characteristics and then select which of 
your kernels to run.

> If you have to do the unrolling in D, then a lot of the 
> advantage is lost and I might just as well write in a shader 
> language...

D can be your compute shading language for Vulkan and with a bit 
of work whatever you'd use HLSL for, it can also be your compute 
kernel language substituting for OpenCL and CUDA. Same caveats 
apply for metal (should be pretty easy to do: need Objective-C 
support in LDC, need Metal bindings).