ldc/dcompute and shared (programmer managed cache) access

Thu Mar 4 04:52:52 UTC 2021

ldc doesn't handle CUDA/Nvidia/PTX shared memory declarations but 
shared memory can be very useful when tuning block (subgroup) 
cooperative algorithms.

Turns out that you can manually (or programatically) inject a 
.shared declaration into the .ptx output file and, thereafter, 
obtain the shared memory pointer with a three instruction 
sequence.

As a slightly cleaner alternative I'll next look at using 
ldc/dcompute as a .o and .ptx generator while punting the 
fatbin/linking stuff to clang or nvcc in a build script.  The 
current simplicity of single-ptx-file is very nice but foregoing 
shared memory performance boosts is not nice so ...  I'm pretty 
sure that we'll need to move beyond the single ptx file model if 
we want to embrace shared cleanly.

I left a comment on the open dcompute git "issue" regarding 
shared but saw no response there.  If you have guidance to give 
on this topic please speak up.

Finally, I'll look at incorporating all the CUDA intrinsics that 
clang does after locking down the shared workaround.  Turns out 
that the very clean irEx hack that Johan provided that works for 
clz apparently only works on a smallish subset of the intrinsics 
(that or my ignorance is showing again).  More on that later.