core.traits?
Ethan
gooberman at gmail.com
Thu Jan 10 10:13:57 UTC 2019
On Thursday, 10 January 2019 at 00:10:18 UTC, Mike Franklin wrote:
> I remember analyzing other implementations of `memcpy` and they
> were all using AVX512. I had faith in the authors of those
> implementations (e.g. Agner Fog) that they knew more than me,
> so that was what I should be using. Perhaps I should revisit it
> and just do the best that DMD can do.
AVX512 is a superset of AVX2, is a superset of AVX, is a superset
of SSE. I expect the implementations you were looking at are
actually implemented in SSE, where SSE2 is a baseline expectation
for x64 processors.
I've done some AVX2 code recently with 256-bit values. The
performance is significantly slower on AMD processors. I assume
their pipeline internally is still 128 bit as a result,
and while my 256-bit code can run faster on Intel it needs to run
on AMD so I've dropped to 128-bit instructions at most -
effectively keeping my code SSE4.1 compatible.
I've done a memset_pattern4[1] implementation in SSE previously.
The important instruction group is _mm_stream. Which, you will
note, was an instruction group first introduced in SSE1 and
hasn't had additional writing stream functions added since since
SSE 4.1[2].
[1]
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/memset_pattern4.3.html
[2]
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5119,5452,5443,5910,5288,5119,5249,5231&text=_mm_stream
More information about the Digitalmars-d
mailing list