core.traits?
Mike Franklin
slavo5150 at yahoo.com
Thu Jan 10 00:10:18 UTC 2019
On Wednesday, 9 January 2019 at 12:31:13 UTC, Patrick Schluter
wrote:
> AVX512 concerns only a very small part of processors on the
> market (Skylake, Canon Lake and Cascade Lake). AMD will never
> implement it and the number of people upgrading to one of the
> lake cpus from some recent chip is also not that great.
Yes, I agree, and even the newer chips have "Enhanced REP MOVSB
and STOSB operation (ERMSB)" which can compensate. See
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf 3.7.6.
> I don't see why not having it implemented yet is blocking
> anything. People who really need AVX512 performance will have
> implemented memcpy themselves already and for the others, they
> will have to wait a little bit. It's not as if it couldn't be
> added later. I really don't understand the problem.
I remember analyzing other implementations of `memcpy` and they
were all using AVX512. I had faith in the authors of those
implementations (e.g. Agner Fog) that they knew more than me, so
that was what I should be using. Perhaps I should revisit it and
just do the best that DMD can do.
But also keep in mind that there's a strategy to getting things
accepted in DMD and elsewhere. You are often battling
perception. The single most challenging aspect of implementing
`memcpy` in D is overcoming bias and justifying it to the
obstructionists that see it as a complete waste of time. If I
can't implement it in AVX512 simply for the purpose of
measurement and comparison, it will be more difficult to justify.
> This said, another issue with memcpy that very often gets lost
> is that, because of the fancy benchmarking, its system
> performance cost is often wrongly assessed, and a lot of heroic
> efforts are put in optimizing big block transfers, while in
> reality it's mostly called on small (postblit) to medium
> blocks. Linus Torvalds had once a rant on that subject on
> realworldtech.
> https://www.realworldtech.com/forum/?threadid=168200&curpostid=168589
I understand. I also encountered a lot of difficulting getting
consistent measurements in my exploration. Doing proper
measurement and analysis for this kind of thing is a skill in and
of itself.
You're right about the small copies being the norm. As part of
my exploration, I write a logging `memcpy` wrapper to see what
kind of copies DMD was doing when it compiled itself, and it was
as you describe.
Perhaps I'll give it another go at a later time, but we need to
get dynamic stack allocation working first because many of the
runtime hook implementations that will utilize `memcpy` do some
error checking and assertions, and we need to be able to generate
dynamic error messages for those assertions when the caller is
`pure`. We need a solution to this
(https://issues.dlang.org/show_bug.cgi?id=18788) first.
Mike
More information about the Digitalmars-d
mailing list