Any usable SIMD implementation?

Sun Apr 17 17:27:06 PDT 2016

On Tuesday, 5 April 2016 at 10:27:46 UTC, Walter Bright wrote:
> Besides, I think it's a poor design to customize the app for 
> only one SIMD type. A better idea (I've repeated this ad 
> nauseum over the years) is to have n modules, one for each 
> supported SIMD type. Compile and link all of them in, then 
> detect the SIMD type at runtime and call the corresponding 
> module. (This is how the D array ops are currently implemented.)

There are many organizations in the world that are building 
software in-house, where such software is targeted to modern CPU 
SIMD types, most typically AVX/AVX2 and crypto instructions.

In these settings -- many of them scientific compute or big data 
center operators -- they know what servers they have, what CPU 
platforms they have. They don't care about portability to the 
past, older computers and so forth. A runtime check would make no 
sense for them, not for their baseline, and it would probably be 
a waste of time for them to design code to run on pre-AVX 
silicon. (AVX is not new anymore -- it's been around for a few 
years.)

Good examples can be found on Cloudflare's blog, especially Vlad 
Krasnov's posts. Here's one where he accelerates Golang's crypto 
libraries: 
https://blog.cloudflare.com/go-crypto-bridging-the-performance-gap/

Companies like CF probably spend millions of dollars on 
electricity, and there are some workloads where AVX-optimized 
code can yield tangible monetary savings.

Someone else said talked about marking "Broadwell" and other 
generation names. As others have said, it's better to specify 
features. I wanted to chime in with a couple of additional 
examples. Intel's transactional memory accelerating instructions 
(TSX) are only available on some Broadwell parts because there 
was a bug in the original implementation (Haswell and early 
Broadwell) and it's disabled on most. But the new Broadwell 
server chips have it, and it's a big deal for some DB workloads. 
Similarly, only some Skylake chips have the Secure Guard 
instructions (SGX), which are very powerful for creating secure 
enclaves on an untrusted host.

On the broader SIMD-as-first-class-citizen issue, I think it 
would be worth thinking about how to bake SIMD into the language 
instead of bolting it on. If I were designing a new language in 
2016, I would take a fresh look at how SIMD could be baked into a 
language's core constructs. I'd think about new loop abstractions 
that could make SIMD easier to exploit, and how to nudge 
programmers away from serial monotonic mindsets and into more of 
a SIMD/FMA way of reasoning.