core.traits?

Thu Jan 10 00:10:18 UTC 2019

On Wednesday, 9 January 2019 at 12:31:13 UTC, Patrick Schluter 
wrote:

> AVX512 concerns only a very small part of processors on the 
> market (Skylake, Canon Lake and Cascade Lake). AMD will never 
> implement it and the number of people upgrading to one of the 
> lake cpus from some recent chip is also not that great.

Yes, I agree, and even the newer chips have "Enhanced REP MOVSB 
and STOSB operation (ERMSB)" which can compensate.  See 
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf 3.7.6.

> I don't see why not having it implemented yet is blocking 
> anything. People who really need AVX512 performance will have 
> implemented memcpy themselves already and for the others, they 
> will have to wait a little bit. It's not as if it couldn't be 
> added later. I really don't understand the problem.

I remember analyzing other implementations of `memcpy` and they 
were all using AVX512.  I had faith in the authors of those 
implementations (e.g. Agner Fog) that they knew more than me, so 
that was what I should be using. Perhaps I should revisit it and 
just do the best that DMD can do.

But also keep in mind that there's a strategy to getting things 
accepted in DMD and elsewhere.  You are often battling 
perception.  The single most challenging aspect of implementing 
`memcpy` in D is overcoming bias and justifying it to the 
obstructionists that see it as a complete waste of time.  If I 
can't implement it in AVX512 simply for the purpose of 
measurement and comparison, it will be more difficult to justify.

> This said, another issue with memcpy that very often gets lost 
> is that, because of the fancy benchmarking, its system 
> performance cost is often wrongly assessed, and a lot of heroic 
> efforts are put in optimizing big block transfers, while in 
> reality it's mostly called on small (postblit) to medium 
> blocks. Linus Torvalds had once a rant on that subject on 
> realworldtech.
> https://www.realworldtech.com/forum/?threadid=168200&curpostid=168589

I understand.  I also encountered a lot of difficulting getting 
consistent measurements in my exploration.  Doing proper 
measurement and analysis for this kind of thing is a skill in and 
of itself.

You're right about the small copies being the norm.  As part of 
my exploration, I write a logging `memcpy` wrapper to see what 
kind of copies DMD was doing when it compiled itself, and it was 
as you describe.

Perhaps I'll give it another go at a later time, but we need to 
get dynamic stack allocation working first because many of the 
runtime hook implementations that will utilize `memcpy` do some 
error checking and assertions, and we need to be able to generate 
dynamic error messages for those assertions when the caller is 
`pure`.  We need a solution to this 
(https://issues.dlang.org/show_bug.cgi?id=18788) first.

Mike