A little Py Vs C++

Fri Nov 2 15:10:40 PDT 2012

Walter Bright wrote:
> On 11/2/2012 3:50 AM, Jens Mueller wrote:
> > Okay. For me they look the same. Can you elaborate, please? Assume I
> > want to add two float vectors which is common in both games and
> > scientific computing. The only difference is in games their length is
> > usually 3 or 4 whereas in scientific computing they are of arbitrary
> > length. Why do I need instrinsics to support the game setting?
> 
> Another excellent question.
> 
> Most languages have taken the "auto-vectorization" approach of
> reverse engineering loops to turn them into high level constructs,
> and then compiling the code into special SIMD instructions.
> 
> How to do this is explained in detail in the (rare) book "The
> Software Vectorization Handbook" by Bik, which I fortunately was
> able to obtain a copy of.
> 
> This struck me as a terrible approach, however. It just seemed
> stupid to try to teach the compiler to reverse engineer low level
> code into high level code. A better design would be to start with
> high level code. Hence, the appearance of D vector operations.
> 
> The trouble with D vector operations, however, is that they are too
> general purpose. The SIMD instructions are very quirky, and it's
> easy to unwittingly and silently cause the compiler to generate
> absolutely terribly slow code. The reasons for that are the
> alignment requirements, coupled with the SIMD instructions not being
> orthogonal - some operations work for some types and not for others,
> in a way that is unintuitive unless you're carefully reading the
> SIMD specs.
> 
> Just saying align(16) isn't good enough, as the vector ops work on
> slices and those slices aren't always aligned. So each one has to
> check alignment at runtime, which is murder on performance.
> 
> If a particular vector op for a particular type has no SIMD support,
> then the compiler has to generate workaround code. This can also
> have terrible performance consequences.
> 
> So the user writes vector code, benchmarks it, finds zero
> improvement, and the reasons why will be elusive to anyone but an
> expert SIMD programmer.
> 
> (Auto-vectorizing technology has similar issues, pretty much meaning
> you won't get fast code out of it unless you've got a habit of
> examining the assembler output and tweaking as necessary.)
> 
> Enter Manu, who has a lot of experience making SIMD work for games.
> His proposal was:
> 
> 1. Have native SIMD types. This will guarantee alignment, and will
> guarantee a compile time error for SIMD types that are not supported
> by the CPU.
> 
> 2. Have the compiler issue an error for SIMD operations that are not
> supported by the CPU, rather than silently generating inefficient
> workaround code.
> 
> 3. There are all kinds of weird but highly useful SIMD instructions
> that don't have a straightforward representation in high level code,
> such as saturated arithmetic. Manu's answer was to expose these
> instructions via intrinsics, so the user can string them together,
> be sure that they will generate real SIMD instructions, while the
> compiler can deal with register allocation.
> 
> This approach works, is inlineable, generates code as good as
> hand-built assembler, and is useable by regular programmers.
> 
> I won't say there aren't better approaches, but this one we know works.

I see. Thanks for clarifying.
If I want fast vector operations I have to use core.simd. The built-in
vector operations won't fit the bill. I was of the opinion that a vector
operation in D should (at some point) generate vectorized code.

Jens