intel-intrinsics v1.0.0

Guillaume Piolat first.last at gmail.com
Thu Feb 14 14:01:53 UTC 2019


On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote:
> On Wednesday, 13 February 2019 at 19:55:05 UTC, Guillaume 
> Piolat wrote:
>> On Wednesday, 13 February 2019 at 04:57:29 UTC, Crayo List 
>> wrote:
>>> On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume 
>>> Piolat wrote:
>>>> "intel-intrinsics" is a DUB package for people interested in 
>>>> x86 performance that want neither to write assembly, nor a 
>>>> LDC-specific snippet... and still have fastest possible code.
>>>>
>>> This is really cool and I appreciate your efforts!
>>>
>>> However (for those who are unaware) there is an alternative 
>>> way that is (arguably) better;
>>> https://ispc.github.io/index.html
>>>
>>> You can write portable vectorized code that can be trivially 
>>> invoked from D.
>>
>> ispc is another compiler in your build, and you'd write in 
>> another language, so it's not really the same thing.
>
> That's mostly what I said, except that I did not say it's the 
> same thing.
> It's an alternative way to produce vectorized code in a 
> deterministic and portable way.
> This is NOT an auto-vectorizing compiler!
>
>> I haven't used it (nor do I know anyone who do) so don't 
>> really know why it would be any better
> And that's precisely why I posted here; for those people that 
> have interest in vectorizing their code in a portable way to be 
> aware that there is another (arguably) better way.
> I highly recommend browsing through the walkthrough example;
> https://ispc.github.io/example.html
>
> For example, I have code that I can run on my Xeon Phi 7250 
> Knights Landing CPU by compiling with 
> --target=avx512knl-i32x16, then I can run the exact same code 
> with no change at all on my i7-5820k by compiling with 
> --target=avx2-i32x8. Each time I get optimal code. This is not 
> something you can easily do with intrinsics!


I don't disagree but ispc sounds more like a host-only OpenCL to 
me, rather than a replacement/competition for intel-intrinsics.

Intrinsics are easy: if calling another compiler with another 
source language might be trivial, then importing a DUB package 
and start using it within the same source code is even more 
trivial!

I take issue with the claim that Single Program Multiple Data 
yields much more performance than well written intrinsics code: 
when your compiler auto-vectorize (or you vectorized using SIMD 
semantics) you _also_ have one instruction for multiple data. The 
only gain I can see for SPMD would be use of non-temporal writes, 
since they are so hard to use effectively in practice.

I also take some issue with "portability": SIMD intrinsics 
optimize quite deterministically (some instructions get generated 
since LDC 1.0.0 -O0), also LLVM IR is portable to ARM, whereas 
ispc will likely never as admitted by its author: 
https://pharr.org/matt/blog/2018/04/29/ispc-retrospective.html

My interests on AVX-512 are subnormal: it can _slow down_ things 
on some x86 CPUs: 
https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774 
In general the latest instructions sets are increasingly hard to 
apply, and have lower yield.

The newer Intel instruction sets are basically a scam for the 
performance-minded. Sponsored work on x265 yields really 
abnormally low results, rewriting things with AVX-512: 
https://software.intel.com/en-us/articles/accelerating-x265-with-intel-advanced-vector-extensions-512-intel-avx-512

As to compiling precisely for the host target: we are building 
B2C software here so don't control the host machine. Thankfully 
the ancient SIMD instructions sets yield most of the value! Since 
a lot of the time memory throughput is the bottleneck.

I can see ispc being more useful when you know the precise model 
of your target Intel CPU. I would also like to see it compare to 
Intel's own software OpenCL: it seems it started its life as 
internal competition.



More information about the Digitalmars-d-announce mailing list