Requesting Help with Optimizing Code

Thu Apr 8 03:58:36 UTC 2021

On Thursday, 8 April 2021 at 03:45:06 UTC, tsbockman wrote:
> On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:
>> Although the obvious point here is vector width (you have
>> AVX-512 from what I can see, however I'm not sure if this is
>> actually a win or not on Skylake W)
>
> From what I've seen, LLVM's code generation and optimization 
> for AVX-512 auto-vectorization is still quite bad and immature 
> compared to AVX2 and earlier, and the wider the SIMD register 
> the more that data structures and algorithms have to be 
> specifically tailored to really benefit from them. Also, using 
> AVX-512 instructions forces the CPU to downclock.
>
> So, I wouldn't expect much benefit from AVX-512 for the time 
> being, unless you're going to hand optimize for it.
>
>> For LDC, you'll want -mcpu=native`.
>
> Only do this if you don't care about the binary working on any 
> CPU but your own. Otherwise, you need to look at something like 
> the Steam Hardware survey and decide what percentage of the 
> market you want to capture (open the "Other Settings" section): 
> https://store.steampowered.com/hwsurvey

You can do multiversioning fairly easily these days.

And AVX-512 downclocking can be quite complicated, I have seen 
benchmarks where one still can achieve a decent speedup even 
*with* downclocking. At very least it's worth profiling - the 
reason why I brought up Skylake W specifically is that some of 
the earlier ones actually emulated the 512 bit vector 
instructions rather than having proper support in the function 
units IIRC.

D needs finer grained control of the optimizer *inside* loops - 
e.g. I don't care about inlining writeln, but if something 
doesn't get inlined inside a hot loop you're fucked.