core.traits?

Thu Jan 10 10:13:57 UTC 2019

On Thursday, 10 January 2019 at 00:10:18 UTC, Mike Franklin wrote:
> I remember analyzing other implementations of `memcpy` and they 
> were all using AVX512.  I had faith in the authors of those 
> implementations (e.g. Agner Fog) that they knew more than me, 
> so that was what I should be using. Perhaps I should revisit it 
> and just do the best that DMD can do.

AVX512 is a superset of AVX2, is a superset of AVX, is a superset 
of SSE. I expect the implementations you were looking at are 
actually implemented in SSE, where SSE2 is a baseline expectation 
for x64 processors.

I've done some AVX2 code recently with 256-bit values. The 
performance is significantly slower on AMD processors. I assume 
their pipeline internally is still 128 bit as a result,
and while my 256-bit code can run faster on Intel it needs to run 
on AMD so I've dropped to 128-bit instructions at most - 
effectively keeping my code SSE4.1 compatible.

I've done a memset_pattern4[1] implementation in SSE previously. 
The important instruction group is _mm_stream. Which, you will 
note, was an instruction group first introduced in SSE1 and 
hasn't had additional writing stream functions added since since 
SSE 4.1[2].

[1] 
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/memset_pattern4.3.html
[2] 
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5119,5452,5443,5910,5288,5119,5249,5231&text=_mm_stream