Any usable SIMD implementation?
9il via Digitalmars-d
digitalmars-d at puremagic.com
Mon Apr 4 23:10:37 PDT 2016
On Monday, 4 April 2016 at 22:34:06 UTC, Walter Bright wrote:
> On 4/4/2016 2:05 PM, 9il wrote:
>>>> - Count of FP/Integer registers
>>> ??
>> How many general purpose registers, SIMD Floating Point
>> registers, SIMD Integer
>> registers have a CPU?
>
> These are deducible from X86, X86_64, and SIMD version
> identifiers.
>
It is impossible to deduct from that combination that Xeon Phi
has 32 FP registers.
>> Needs to know is it AVX or AVX2 in compile time
>
> Since the compiler never generates AVX or AVX2 instructions,
> there is no purpose to setting such as a predefined version
> identifier. You might as well use a:
>
> -version=AVX
>
> switch. Note that it is a very bad idea for a compiler to
> detect the CPU it is running on and default generate code
> specific to that CPU.
>
"Since the compiler never generates AVX or AVX2" - this is
definitely nor true, see, for example, LLVM vectorization and SLP
vectorization.
This is normal situation for scientific software, supercomputers
software, hight performance server applications.
>
>> (this may be completely different source code for this cases).
>
> It's entirely practical to compile code with different source
> code, link them *both* into the executable, and switch between
> them based on runtime detection of the CPU.
>
This approach is complex, and normal for desktop applications. If
you have a big cluster of similar computers or you have a
supercomputer cluster, only the thing you want to do is
`-mcpu=native`/ `-march=native`. And this single compiler flag
should be enough to build hight performance linear algebra
application.
>
>> We have LDC and GDC. And looks like a little bit
>> standardization based on DMD
>> would be good, even if this would be useless for DMD.
>
> There is no such thing as a standard compiler floating point
> switch, and I'm doubtful defining one would be practical or
> make much of any sense.
>
I just want an unified instrument to receive CT information about
target and optimization switches. It is OK if this information
would have different switches on different compilers.
>
>> With compile time information about CPU it is possible to
>> always have fast
>> generic BLAS for any target as soon as LLVM is released for
>> this target.
>
> The SIMD instruction set is highly resistant to transforming
> generic code into optimal vector instructions. Yes, I know
> about auto-vectorization, and in general it is a doomed and
> unworkable technology.
>
> http://www.amazon.com/dp/0974364924
>
> It's gotta be done by hand to get it to fly.
Auto vectorization is only example (maybe bad). I would use SIMD
vectors, but I need CT information about target CPU, because it
is impossible to build optimal BLAS kernels without it! My idea
is internal kernel compiler :-) Something similar to compile time
regex, but more complex.
Best regards,
Ilya
More information about the Digitalmars-d
mailing list