Scientific computing and parallel computing C++23/C++26
iamthewilsonator at hotmail.com
Thu Jan 20 08:20:58 UTC 2022
On Thursday, 20 January 2022 at 06:57:28 UTC, Ola Fosheim Grøstad
> On Thursday, 20 January 2022 at 00:43:30 UTC, Nicholas Wilson
>> I mean there are parametric attributes of the hardware, say
>> for example cache size (or available registers for GPUs), that
>> have a direct effect on how many times you can unroll the
>> inner loop, say for a windowing function, and you want to ship
>> optimised code for multiple configurations of hardware.
>> You can much more easily create multiple copies for different
>> sized cache (or register availability) in D than you can in
>> C++, because static foreach and static if >>> if constexpr.
> Hmm, I dont understand, the unrolling should happen at runtime
> so that you can target all GPUs with one executable?
Now you've confused me. You can select which implementation to
use at runtime with e.g. CPUID or more sophisticated methods. LDC
targeting DCompute can produce multiple objects with the same
compiler invocation, i.e. you can get CUDA for any set of SM
version, OpenCL compatible SPIR-V which you can get per GPU,
inspect its hardware characteristics and then select which of
your kernels to run.
> If you have to do the unrolling in D, then a lot of the
> advantage is lost and I might just as well write in a shader
D can be your compute shading language for Vulkan and with a bit
of work whatever you'd use HLSL for, it can also be your compute
kernel language substituting for OpenCL and CUDA. Same caveats
apply for metal (should be pretty easy to do: need Objective-C
support in LDC, need Metal bindings).
More information about the Digitalmars-d