ldc/dcompute and shared (programmer managed cache) access

Bruce Carneal bcarneal at gmail.com
Fri Mar 5 04:49:56 UTC 2021


On Friday, 5 March 2021 at 00:39:29 UTC, kinke wrote:
> On Thursday, 4 March 2021 at 04:52:52 UTC, Bruce Carneal wrote:
>> As a slightly cleaner alternative I'll next look at using 
>> ldc/dcompute as a .o and .ptx generator while punting the 
>> fatbin/linking stuff to clang or nvcc in a build script.  The 
>> current simplicity of single-ptx-file is very nice but 
>> foregoing shared memory performance boosts is not nice so ...  
>> I'm pretty sure that we'll need to move beyond the single ptx 
>> file model if we want to embrace shared cleanly.
>
> Not sure if it's of any help, but using something like 
> `-mtriple=nvptx64 -mcpu=sm_50` might suffice: 
> https://github.com/ldc-developers/ldc/pull/3411#issuecomment-619385886

Per the other thread that you've seen, and helped with, the 
intrinsics are mostly good to go.  Thanks.

Unfortunately CUDA __shared__ is not, AFAICT, addressed by the 
pull request that you cited.  We need some way to get a 
__shared__  declaration emitted to the PTX file.

The code generation through standalone (uninitialized) __shared__ 
pointers, which we *can* declare, already emits the correct PTX 
instructions (ld.shared, st.shared) but we have no way, at least 
that I've found, to declare __shared__ address space storage that 
we can point to.  (clang cuda can do this, nvvc can obviously do 
this, and we can do this currently only if we manually modify the 
.PTX file)

The newly available shuffle intrinsics provide for efficient 
cooperative processing at the warp (32 lane) level, which helps a 
lot.  The __shared__ address space capability would allow for 
more efficient cooperation for up to 1024 lanes (an Nvidia 
"block" upper limit currently IIUC).

As I understand it the L1 caches on Nvidia cards are now 
writeback so, with the very relaxed memory model in play on GPUs 
an explicit programmer scratchpad (__shared__) is not as 
important as it used to be for many apps but in some situations 
it's still important.  Would be great if we could find a way to 
provide the capability some where down the road.





More information about the digitalmars-d-ldc mailing list