core.simd woes
F i L
witte2008 at gmail.com
Mon Oct 1 19:28:26 PDT 2012
Not to resurrect the dead, I just wanted to share an article I
came across concerning SIMD with Manu..
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
QUOTE:
1. Returning results by value
By observing the intrisics interface a vector library must
imitate that interface to maximize performance. Therefore, you
must return the results by value and not by reference, as such:
//correct
inline Vec4 VAdd(Vec4 va, Vec4 vb)
{
return(_mm_add_ps(va, vb));
};
On the other hand if the data is returned by reference the
interface will generate code bloat. The incorrect version below:
//incorrect (code bloat!)
inline void VAddSlow(Vec4& vr, Vec4 va, Vec4 vb)
{
vr = _mm_add_ps(va, vb);
};
The reason you must return data by value is because the quad-word
(128-bit) fits nicely inside one SIMD register. And one of the
key factors of a vector library is to keep the data inside these
registers as much as possible. By doing that, you avoid
unnecessary loads and stores operations from SIMD registers to
memory or FPU registers. When combining multiple vector
operations the "returned by value" interface allows the compiler
to optimize these loads and stores easily by minimizing SIMD to
FPU or memory transfers.
2. Data Declared "Purely"
Here, "pure data" is defined as data declared outside a "class"
or "struct" by a simple "typedef" or "define". When I was
researching various vector libraries before coding VMath, I
observed one common pattern among all libraries I looked at
during that time. In all cases, developers wrapped the basic
quad-word type inside a "class" or "struct" instead of declaring
it purely, as follows:
class Vec4
{
...
private:
__m128 xyzw;
};
This type of data encapsulation is a common practice among C++
developers to make the architecture of the software robust. The
data is protected and can be accessed only by the class interface
functions. Nonetheless, this design causes code bloat by many
different compilers in different platforms, especially if some
sort of GCC port is being used.
An approach that is much friendlier to the compiler is to declare
the vector data "purely", as follows:
typedef __m128 Vec4;
ENDQUOTE;
The article is 2 years old, but It appears my earlier performance
issue wasn't D related at all, but an issue with C as well. I
think in this situation, it might be best (most optimized) to
handle simd "the C way" by creating and alias or union of a simd
intrinsic. D has a big advantage over C/C++ here because of UFCS,
in that we can write external functions that appear no different
to encapsulated object methods. That combined with
public-aliasing means the end-user only sees our pretty
functions, but we're not sacrificing performance at all.
More information about the Digitalmars-d
mailing list