core.traits?

Thu Jan 10 21:01:09 UTC 2019

On Thursday, 10 January 2019 at 10:13:57 UTC, Ethan wrote:
> On Thursday, 10 January 2019 at 00:10:18 UTC, Mike Franklin 
> wrote:
>> I remember analyzing other implementations of `memcpy` and 
>> they were all using AVX512.  I had faith in the authors of 
>> those implementations (e.g. Agner Fog) that they knew more 
>> than me, so that was what I should be using. Perhaps I should 
>> revisit it and just do the best that DMD can do.
>
> AVX512 is a superset of AVX2, is a superset of AVX, is a 
> superset of SSE. I expect the implementations you were looking 
> at are actually implemented in SSE, where SSE2 is a baseline 
> expectation for x64 processors.
>
> I've done some AVX2 code recently with 256-bit values. The 
> performance is significantly slower on AMD processors. I assume 
> their pipeline internally is still 128 bit as a result,
> and while my 256-bit code can run faster on Intel it needs to 
> run on AMD so I've dropped to 128-bit instructions at most - 
> effectively keeping my code SSE4.1 compatible.
>
> I've done a memset_pattern4[1] implementation in SSE 
> previously. The important instruction group is _mm_stream. 
> Which, you will note, was an instruction group first introduced 
> in SSE1 and hasn't had additional writing stream functions 
> added since since SSE 4.1[2].
>
> [1] 
> https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/memset_pattern4.3.html
> [2] 
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5119,5452,5443,5910,5288,5119,5249,5231&text=_mm_stream

That's disappointing to learn. Ryzen has four 128-bit AVX units, 
2 of them can only do addition and the other 2 can only do 
multiplication. Not sure how the memory is shared between units 
but if it isn't then it'd need to copy to be able to do an 
addition then a multiplication.