Scientific computing and parallel computing C++23/C++26

Sat Jan 15 10:52:29 UTC 2022

On Saturday, 15 January 2022 at 09:03:11 UTC, Nicholas Wilson 
wrote:
> On Saturday, 15 January 2022 at 08:01:15 UTC, Paulo Pinto wrote:
>> On Saturday, 15 January 2022 at 00:29:20 UTC, Nicholas Wilson 
>> wrote:
>>> ....
>>>
>>> Definitely. Homogenous memory is interesting for the ability 
>>> to make GPUs do the things GPUs are good at and leave the 
>>> rest to the CPU without worrying about memory transfer across 
>>> the PCI-e. Something which CUDA can't take advantage of on 
>>> account of nvidia GPUs being only discrete. I've no idea how 
>>> cacheing work in a system like that though.
>>> ...
>>
>> How is this different from unified memory?
>>
>> https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd
>
> there still a PCI-e in-between. Fundamentally the memory must 
> exist in either the CPUs RAM or the GPUs (V)RAM, from what I 
> understand unified memory allows the GPU to access the host RAM 
> with the same pointer. This reduces the total memory consumed 
> by the program, but to get to the GPU the data must still cross 
> the PCI-e.

Yes.  You also gain some simplification, from unified memory, if 
your data structures are pointer heavy.

I've tried to gain advantage from GPU-side pulls across the bus 
in the past but could never win out over explicit async copying 
utilizing dedicated copy circuitry.  Others, particularly those 
with high compute-to-load/store ratios, may have had better luck.

For reference, I've only been able to get a little over 80% of 
the advertised PCI-e peak bandwidth out of the dedicated Nvidia 
copy HW.