From bcarneal at gmail.com  Tue Apr  6 02:23:59 2021
From: bcarneal at gmail.com (Bruce Carneal)
Date: Tue, 06 Apr 2021 02:23:59 +0000
Subject: ldc/dcompute atomics for nvptx?
Message-ID: <idhqcllpsyqtfwvrqeox@forum.dlang.org>

I'd like to use atomic (rmw) operations from within ldc while 
targeting nvptx (via dcompute).

The first place to check is dcompute.std.atomic.  That's a nice 
placeholder, but only a placeholder, so I started poking around 
in ldc and clang.  After a modest amount of poking I'm still not 
sure how to proceed.

If you know of a simple way to bring atomics online for 
dcompute/nvptx, I'd like to hear from you.  Alternatively, if you 
know why nvptx atomics will be hard to bring online, I'd also 
like to hear from you.

On a positive note, I've had some success in using dcompute/D's 
meta programming facilities reworking areal/stencil compute 
kernels to operate out of "arrays of registers".  You meta-unroll 
til you wrap around the stencil, avoiding moves, and you can use 
intra-warp shuffles to/from lateral neighbors to minimize load on 
the memory subsystem when rolling on to the next row.

Another D advantage over CUDA/C++ that can be exploited is nested 
functions.  You can declare variables at the outer function level 
where they'll pretty much all be mapped to registers (you've got 
at least 64 per SIMT "lane" to work with, and it's easy to check 
for spills). You can then access those enregistered variables 
directly from within the nested functions.  Sometimes it's nice 
not having to pass everything through an argument list.

Thanks again to the ldc/dcompute team for providing the tooling 
that makes the above possible.  And thanks in advance for any 
guidance on getting atomics up for nvptx.


From iamthewilsonator at hotmail.com  Sun Apr 25 00:13:43 2021
From: iamthewilsonator at hotmail.com (Nicholas Wilson)
Date: Sun, 25 Apr 2021 00:13:43 +0000
Subject: ldc/dcompute atomics for nvptx?
In-Reply-To: <idhqcllpsyqtfwvrqeox@forum.dlang.org>
References: <idhqcllpsyqtfwvrqeox@forum.dlang.org>
Message-ID: <haoqxmvjyhezpqywkyfj@forum.dlang.org>

On Tuesday, 6 April 2021 at 02:23:59 UTC, Bruce Carneal wrote:
> I'd like to use atomic (rmw) operations from within ldc while 
> targeting nvptx (via dcompute).
>
> [...]

Sorry or the late reply. These should all be doable with 
pragma(LDC_intrinsic, "llvm.nvvm.atomic.*") where * is any of 
"add", "load.add", etc, I'll try to get a full list. but there is 
no real difference between this say std.cuda.index

I never implemented them because I didn't need them and my card 
didn't support them.

From j at j.nl  Sun Apr 25 22:26:06 2021
From: j at j.nl (Johan Engelen)
Date: Sun, 25 Apr 2021 22:26:06 +0000
Subject: ldc nvvm GPU intrinsics good news
In-Reply-To: <mdgkvutlshkdkksylhvw@forum.dlang.org>
References: <mdgkvutlshkdkksylhvw@forum.dlang.org>
Message-ID: <sjdupoenzdgodghsoawl@forum.dlang.org>

On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
> After updating the first line to 
> '@compute(CompileFor.hostAndDevice) module ...' and adding an 
> 'import ldc.dcompute;' line, the 
> runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
> build apparently gives access to all manner of GPU intrinsics.

Hi Bruce,
   Why not submit a PR that modifies `gen_gccbuiltins.cpp` such 
that it adds the `@compute` attribute for the relevant intrinsics 
files?
I think it's OK if `gen_gccbuiltins` contains some hacks like 
that . Please add a small compile test case, so we verify that it 
won't bitrot in the future.

Wouldn't `@compute(CompileFor.deviceOnly)` make more sense, 
because the intrinsics will not be available on normal CPUs 
anyway?

I hope all your work will land in either LDC or dcompute's 
repositories, such that others can easily benefit from it.

cheers,
   Johan


From bcarneal at gmail.com  Mon Apr 26 13:20:11 2021
From: bcarneal at gmail.com (Bruce Carneal)
Date: Mon, 26 Apr 2021 13:20:11 +0000
Subject: ldc nvvm GPU intrinsics good news
In-Reply-To: <sjdupoenzdgodghsoawl@forum.dlang.org>
References: <mdgkvutlshkdkksylhvw@forum.dlang.org>
 <sjdupoenzdgodghsoawl@forum.dlang.org>
Message-ID: <ywgnpszlrvuvuqoeeahw@forum.dlang.org>

On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:
> On Friday, 5 March 2021 at 00:03:26 UTC, Bruce Carneal wrote:
>> After updating the first line to 
>> '@compute(CompileFor.hostAndDevice) module ...' and adding an 
>> 'import ldc.dcompute;' line, the 
>> runtime/import/ldc/gccbuiltins_nvvm.di file from a current LDC 
>> build apparently gives access to all manner of GPU intrinsics.
>
> Hi Bruce,
>   Why not submit a PR that modifies `gen_gccbuiltins.cpp` such 
> that it adds the `@compute` attribute for the relevant 
> intrinsics files?
> I think it's OK if `gen_gccbuiltins` contains some hacks like 
> that . Please add a small compile test case, so we verify that 
> it won't bitrot in the future.
>
> Wouldn't `@compute(CompileFor.deviceOnly)` make more sense, 
> because the intrinsics will not be available on normal CPUs 
> anyway?
>
> I hope all your work will land in either LDC or dcompute's 
> repositories, such that others can easily benefit from it.
>
> cheers,
>   Johan

Yes, I'll help when the current push is over here, but I think I 
dont understand enough quite yet.  I'm still bumping in to 
limitations/awkwardness in dcompute that should admit simple 
solutions. At least it feels that way.

One idea from my experience to date is that we can and probably 
should create a simpler (from a programmer perspective) and finer 
granularity way to handle multiple targets.  Intrinsic selection 
is part of that as is library selection.

Also on my mind is how we should handle deployment.  For the 
ultimate in speed we can do AOT per-target specialized compiles 
and "fat" binaries but using SPIR-V + Vulkan compute could 
significantly improve penetration and reduce bloat.

I read a relatively recent thread in an LLVM forum indicating 
that the Intel guys are pushing a "real" SPIR-V IR effort now so 
maybe we can help out there.

Also, I dont know how MLIR should fit in to our plans.

I'll be in touch when I get my head above water here.  Thanks to 
you and the rest of the LDC crew for the help so far.  Looking 
forward to advancing dlang on GPUs in the future.  It really can 
be much much better than C++ in that arena.

Bruce


From johan_forsberg_86 at hotmail.com  Tue Apr 27 07:18:18 2021
From: johan_forsberg_86 at hotmail.com (Imperatorn)
Date: Tue, 27 Apr 2021 07:18:18 +0000
Subject: ldc nvvm GPU intrinsics good news
In-Reply-To: <ywgnpszlrvuvuqoeeahw@forum.dlang.org>
References: <mdgkvutlshkdkksylhvw@forum.dlang.org>
 <sjdupoenzdgodghsoawl@forum.dlang.org> <ywgnpszlrvuvuqoeeahw@forum.dlang.org>
Message-ID: <aoivhicigufqgmtlcgne@forum.dlang.org>

On Monday, 26 April 2021 at 13:20:11 UTC, Bruce Carneal wrote:
> On Sunday, 25 April 2021 at 22:26:06 UTC, Johan Engelen wrote:
>>   [...]
>
> Yes, I'll help when the current push is over here, but I think 
> I dont understand enough quite yet.  I'm still bumping in to 
> limitations/awkwardness in dcompute that should admit simple 
> solutions. At least it feels that way.
>
> [...]

Nice work, thanks for wanting to improve dcompute! I think D has 
real potential there