Align a variable on the stack.

Wed Nov 4 19:52:46 PST 2015

On Wednesday, 4 November 2015 at 01:14:31 UTC, Nicholas Wilson 
wrote:
> Note that there are two different alignments:
>          to control padding between instances on the stack 
> (arrays)
>          to control padding between members of a struct
>
> align(64) //arrays
> struct foo
> {
>       align(16) short baz; //between members
>       align (1) float quux;
> }
>
> your 2.5x speedup is due to aligned vs. unaligned loads and 
> stores which for SIMD type stuff has a really big effect. 
> Basically misaligned stuff is really slow. IIRC there was a 
> (blog/paper?) of someone on a uC spending a vast amount of time 
> in ONE misaligned integer assignment causing traps and getting 
> the kernel involved. Not quite as bad on x86 but still with 
> doing.
>
> As to a less jacky solution I'm not sure there is one.

Thanks for the reply. I did some more checking around and I found 
that it was not really an alignment problem but was caused by 
using the default init value of my type.

My starting type.
align(64) struct Phys
{
    float x, y, z, w;
    //More stuff.
} //Was 64 bytes in size at the time.

The above worked fine, it was fast and all. But after a while I 
wanted the data in a diffrent format. So I started decoding 
positions, and other variables in separate arrays.

Something like this:
align(16) struct Pos { float x, y, z, w; }

This counter to my limited knowledge of how cpu's work was much 
slower. Doing the same thing lot's of times, touching less memory 
with less branches should in theory at-least be faster right? So 
after I ruled out bottlenecks in the parser I assumed there was 
some alignment problems so I did my Aligner hack. This caused to 
code to run faster so I assumed this was the cause... Naive! 
(there was a typo in the code I submitted to begin with I used a 
= Align!(T).init and not a.value = T.init)

The performance was actually cased by the line : t = T.init no 
matter if it was aligned or not. I solved the problem by changing 
the struct to look like this.
align(16) struct Pos
{
     float x = float.nan;
     float y = float.nan;
     float z = float.nan;
     float w = float.nan;
}

Basically T.init get's explicit values. But... this should be the 
same Pos.init as the default Pos.init. So I really fail to 
understand how this could fix the problem. I guessed the compiler 
generates some slightly different code if I do it this way? And 
that this slightly different code fixes some bottleneck in the 
cpu. But when I took a look at the assembly of the function I 
could not find any difference in the generated code...

I don't really know where to go from here to figure out the 
underlying cause. Does anyone have any suggestions?