Align a variable on the stack.
TheFlyingFiddle via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Wed Nov 4 19:52:46 PST 2015
On Wednesday, 4 November 2015 at 01:14:31 UTC, Nicholas Wilson
wrote:
> Note that there are two different alignments:
> to control padding between instances on the stack
> (arrays)
> to control padding between members of a struct
>
> align(64) //arrays
> struct foo
> {
> align(16) short baz; //between members
> align (1) float quux;
> }
>
> your 2.5x speedup is due to aligned vs. unaligned loads and
> stores which for SIMD type stuff has a really big effect.
> Basically misaligned stuff is really slow. IIRC there was a
> (blog/paper?) of someone on a uC spending a vast amount of time
> in ONE misaligned integer assignment causing traps and getting
> the kernel involved. Not quite as bad on x86 but still with
> doing.
>
> As to a less jacky solution I'm not sure there is one.
Thanks for the reply. I did some more checking around and I found
that it was not really an alignment problem but was caused by
using the default init value of my type.
My starting type.
align(64) struct Phys
{
float x, y, z, w;
//More stuff.
} //Was 64 bytes in size at the time.
The above worked fine, it was fast and all. But after a while I
wanted the data in a diffrent format. So I started decoding
positions, and other variables in separate arrays.
Something like this:
align(16) struct Pos { float x, y, z, w; }
This counter to my limited knowledge of how cpu's work was much
slower. Doing the same thing lot's of times, touching less memory
with less branches should in theory at-least be faster right? So
after I ruled out bottlenecks in the parser I assumed there was
some alignment problems so I did my Aligner hack. This caused to
code to run faster so I assumed this was the cause... Naive!
(there was a typo in the code I submitted to begin with I used a
= Align!(T).init and not a.value = T.init)
The performance was actually cased by the line : t = T.init no
matter if it was aligned or not. I solved the problem by changing
the struct to look like this.
align(16) struct Pos
{
float x = float.nan;
float y = float.nan;
float z = float.nan;
float w = float.nan;
}
Basically T.init get's explicit values. But... this should be the
same Pos.init as the default Pos.init. So I really fail to
understand how this could fix the problem. I guessed the compiler
generates some slightly different code if I do it this way? And
that this slightly different code fixes some bottleneck in the
cpu. But when I took a look at the assembly of the function I
could not find any difference in the generated code...
I don't really know where to go from here to figure out the
underlying cause. Does anyone have any suggestions?
More information about the Digitalmars-d-learn
mailing list