core.simd woes

F i L witte2008 at gmail.com
Mon Oct 1 19:28:26 PDT 2012


Not to resurrect the dead, I just wanted to share an article I 
came across concerning SIMD with Manu..

http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php

QUOTE:

1. Returning results by value

By observing the intrisics interface a vector library must 
imitate that interface to maximize performance. Therefore, you 
must return the results by value and not by reference, as such:

     //correct
     inline Vec4 VAdd(Vec4 va, Vec4 vb)
     {
         return(_mm_add_ps(va, vb));
     };

On the other hand if the data is returned by reference the 
interface will generate code bloat. The incorrect version below:

     //incorrect (code bloat!)
     inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
     {
         vr = _mm_add_ps(va, vb);
     };

The reason you must return data by value is because the quad-word 
(128-bit) fits nicely inside one SIMD register. And one of the 
key factors of a vector library is to keep the data inside these 
registers as much as possible. By doing that, you avoid 
unnecessary loads and stores operations from SIMD registers to 
memory or FPU registers. When combining multiple vector 
operations the "returned by value" interface allows the compiler 
to optimize these loads and stores easily by minimizing SIMD to 
FPU or memory transfers.

2. Data Declared "Purely"

Here, "pure data" is defined as data declared outside a "class" 
or "struct" by a simple "typedef" or "define". When I was 
researching various vector libraries before coding VMath, I 
observed one common pattern among all libraries I looked at 
during that time. In all cases, developers wrapped the basic 
quad-word type inside a "class" or "struct" instead of declaring 
it purely, as follows:

     class Vec4
     {	
         ...
     private:
         __m128 xyzw;
     };

This type of data encapsulation is a common practice among C++ 
developers to make the architecture of the software robust. The 
data is protected and can be accessed only by the class interface 
functions. Nonetheless, this design causes code bloat by many 
different compilers in different platforms, especially if some 
sort of GCC port is being used.

An approach that is much friendlier to the compiler is to declare 
the vector data "purely", as follows:

typedef __m128 Vec4;

ENDQUOTE;




The article is 2 years old, but It appears my earlier performance 
issue wasn't D related at all, but an issue with C as well. I 
think in this situation, it might be best (most optimized) to 
handle simd "the C way" by creating and alias or union of a simd 
intrinsic. D has a big advantage over C/C++ here because of UFCS, 
in that we can write external functions that appear no different 
to encapsulated object methods. That combined with 
public-aliasing means the end-user only sees our pretty 
functions, but we're not sacrificing performance at all.


More information about the Digitalmars-d mailing list