core.traits?
luckoverthere
luckoverthere at gmail.cm
Thu Jan 10 21:01:09 UTC 2019
On Thursday, 10 January 2019 at 10:13:57 UTC, Ethan wrote:
> On Thursday, 10 January 2019 at 00:10:18 UTC, Mike Franklin
> wrote:
>> I remember analyzing other implementations of `memcpy` and
>> they were all using AVX512. I had faith in the authors of
>> those implementations (e.g. Agner Fog) that they knew more
>> than me, so that was what I should be using. Perhaps I should
>> revisit it and just do the best that DMD can do.
>
> AVX512 is a superset of AVX2, is a superset of AVX, is a
> superset of SSE. I expect the implementations you were looking
> at are actually implemented in SSE, where SSE2 is a baseline
> expectation for x64 processors.
>
> I've done some AVX2 code recently with 256-bit values. The
> performance is significantly slower on AMD processors. I assume
> their pipeline internally is still 128 bit as a result,
> and while my 256-bit code can run faster on Intel it needs to
> run on AMD so I've dropped to 128-bit instructions at most -
> effectively keeping my code SSE4.1 compatible.
>
> I've done a memset_pattern4[1] implementation in SSE
> previously. The important instruction group is _mm_stream.
> Which, you will note, was an instruction group first introduced
> in SSE1 and hasn't had additional writing stream functions
> added since since SSE 4.1[2].
>
> [1]
> https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/memset_pattern4.3.html
> [2]
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5119,5452,5443,5910,5288,5119,5249,5231&text=_mm_stream
That's disappointing to learn. Ryzen has four 128-bit AVX units,
2 of them can only do addition and the other 2 can only do
multiplication. Not sure how the memory is shared between units
but if it isn't then it'd need to copy to be able to do an
addition then a multiplication.
More information about the Digitalmars-d
mailing list