ldc/dcompute nvptx intrinsics

Fri Mar 5 05:03:13 UTC 2021

On Tuesday, 23 February 2021 at 23:36:53 UTC, Bruce Carneal wrote:
> On Tuesday, 23 February 2021 at 18:04:52 UTC, Johan wrote:
>> On Sunday, 21 February 2021 at 01:18:10 UTC, Bruce Carneal 
>> wrote:
[...]
>>
>> Hi Bruce,
>>   I played around a bit and have a full working example for 
>> you:
>>
>> ```
>> @compute(CompileFor.deviceOnly) module dcompute;
>>
>> import ldc.dcompute;
>>
>> [... working example ...]
>>
>> If this indeed will fit your usecase, you have a good argument 
>> for including `__irEx` into ldc.dcompute. Please file 
>> bugs/features on github!
>>
>> cheers,
>>   Johan
>
> Success!
>
> After verifying that your latest example worked I plugged in 
> llvm.nvvm.clz.i from the .td file.  That generated the hoped 
> for single instruction function body:
>   clz.b32  %r2, %r1
> This clz intrinsic alone saves me a couple dozen instructions 
> in a hot section of code where I "call" clz twice.
>
> I will expand the set of intrinsics enabled via the __irEx 
> method over the next few days and then try to contact Nicholas 
> W. and/or John C. via email or beerconf to get their take on 
> the capabilities (they may suggest a more easily supported way 
> to go about things, or have cuda naming suggestions, or want to 
> rationalize these with OCL, or ...).  Assuming that goes well 
> I'll file with ldc and dcompute.
>
> Thank you Johan.

Turns out that almost all intrinsics are available after a tiny 
bit of .di file massaging (see the "ldc nvvm GPU intrinsics good 
news" thread).  The irEx capability Johan pointed out seems to 
cover the rest.

The irEx accessible set is defined at the top of the intrinsics 
.td file and includes the popc intrinsic, the clz initrinsic, and 
about a dozen others, most very useful.  The .di file instrinsics 
number over 500, some of which appear useful.