A look at Chapel, D, and Julia using kernel matrix calculations
welkam
wwwelkam at gmail.com
Mon May 25 21:27:20 UTC 2020
On Sunday, 24 May 2020 at 16:51:37 UTC, data pulverizer wrote:
> My CPU is coffee lake (Intel i9-8950HK CPU) which is not listed
> under `--mcpu=help`
Just use --mcpu=native. Compiler will check your cpu and use
correct flag for you. If you want to manually specify the
architecture then look here
https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support
> I tried `--mcpu=core-avx2 -mattr=+avx2,+sse4.1,+sse4.2` and
> getting the same improved performance as when using
> `--mcpu=native` am I correct in assuming that `core-avx2` is
> right for my CPU?
These flags are for fine grain control. If you have to ask about
them then that means you should not use them. I would have to
google to answer your question. When you use --mcpu=native all
appropriate flags will be set. You dont have to worry about them.
For a data scientist here is a list of flags that you should be
using and in order of importance.
--O2 (Turning on optimizations is good)
--mcpu=native (allows compiler to use newer instructions and
enable architecture specific optimizations. Just dont share the
binaries because they might crash on older CPU's)
--O3 (less important than mcpu and sometimes doesnt provide any
speed improvements so measure measure measure)
--flto=thin (link time optimizations. Good when using libraries.)
PGO (not a single flag but profile guided optimizations can add
few % improvement on top of all of other flags)
http://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html
--ffast-math (only useful for floating point (float, double). If
you dont do math with those types then this flag does nothing)
--boundscheck=off (is D specific flag. majority of array bounds
checking is remove by compiler without this flag but its good to
throw it in just to make sure. But dont use this flag in
development because it can catch bugs.)
When reading your message I get impression that you assumed that
those newer instruction will improve performance. When it comes
to performance never assume anything. Always profile before
making judgments. Maybe your CPU is limited by memory bandwidth
if you only have one stick of RAM and you use all 6 cores.
Anyway I looked at the disassembly of one function and its mostly
SSE instructions with one AVX. That function is
arrays.Matrix!(float).Matrix
kernelmatrix.calculateKernelMatrix!(kernelmatrix.DotProduct!(float).DotProduct, float).calculateKernelMatrix(kernelmatrix.DotProduct!(float).DotProduct, arrays.Matrix!(float).Matrix)
For SIMD instruction work D has specific vector types. I believe
compiler guarantees that they are properly aligned but its not
stated in the doc.
https://dlang.org/spec/simd.html
I have 0 experience in writing SIMD code but from what I heard
over the years is that if you want to get max performance from
your CPU you have to write your kernels with SIMD intrinsics.
More information about the Digitalmars-d
mailing list