Any usable SIMD implementation?

Mon Apr 4 23:10:37 PDT 2016

On Monday, 4 April 2016 at 22:34:06 UTC, Walter Bright wrote:
> On 4/4/2016 2:05 PM, 9il wrote:
>>>> - Count of FP/Integer registers
>>> ??
>> How many general purpose registers, SIMD Floating Point 
>> registers, SIMD Integer
>> registers have a CPU?
>
> These are deducible from X86, X86_64, and SIMD version 
> identifiers.
>

It is impossible to deduct from that combination that Xeon Phi 
has 32 FP registers.

>> Needs to know is it AVX or AVX2 in compile time
>
> Since the compiler never generates AVX or AVX2 instructions, 
> there is no purpose to setting such as a predefined version 
> identifier. You might as well use a:
>
>     -version=AVX
>
> switch. Note that it is a very bad idea for a compiler to 
> detect the CPU it is running on and default generate code 
> specific to that CPU.
>

"Since the compiler never generates AVX or AVX2" - this is 
definitely nor true, see, for example, LLVM vectorization and SLP 
vectorization.

This is normal situation for scientific software, supercomputers 
software, hight performance server applications.

>
>> (this may be completely different source code for this cases).
>
> It's entirely practical to compile code with different source 
> code, link them *both* into the executable, and switch between 
> them based on runtime detection of the CPU.
>

This approach is complex, and normal for desktop applications. If 
you have a big cluster of similar computers or you have a 
supercomputer cluster, only the thing you want to do is 
`-mcpu=native`/ `-march=native`. And this single compiler flag 
should be enough to build hight performance linear algebra 
application.

>
>> We have LDC and GDC. And looks like a little bit 
>> standardization based on DMD
>> would be good, even if this would be useless for DMD.
>
> There is no such thing as a standard compiler floating point 
> switch, and I'm doubtful defining one would be practical or 
> make much of any sense.
>

I just want an unified instrument to receive CT information about 
target and optimization switches. It is OK if this information 
would have different switches on different compilers.

>
>> With compile time information about CPU it is possible to 
>> always have fast
>> generic BLAS for any target as soon as LLVM is released for 
>> this target.
>
> The SIMD instruction set is highly resistant to transforming 
> generic code into optimal vector instructions. Yes, I know 
> about auto-vectorization, and in general it is a doomed and 
> unworkable technology.
>
>   http://www.amazon.com/dp/0974364924
>
> It's gotta be done by hand to get it to fly.

Auto vectorization is only example (maybe bad). I would use SIMD 
vectors, but I need CT information about target CPU, because it 
is impossible to build optimal BLAS kernels without it!  My idea 
is internal kernel compiler :-) Something similar to compile time 
regex, but more complex.

Best regards,
Ilya