Does dmd have SSE intrinsics?

Tue Sep 22 08:00:32 PDT 2009

Don wrote:
> bearophile wrote:
>> Robert Jacques:
>>
>>> Yes, but the unaligned version is slower, even for aligned data.
>>
>> This is true today, but in future it may become a little less true, 
>> thanks to improvements in the CPUs.
> 
> The problem is that difference today is so extreme. On core2:
>  movaps [mem128], xmm0; // aligned,   1 micro-op
>  movups [mem128], xmm0; // unaligned, 9 micro-ops, even on aligned data!
> In practice it's about an 8X speed difference!
> 
> On AMD K8, it's only 2 vs 5 ops, and on K10 it's 2 vs 3 ops.
> On i7, movups on aligned data is the same speed as movaps. It's still 
> slower if it's an unaligned access.
> 
> It all depends on how important you think performance on Core2 and 
> earlier Intel processors is.

I wasn't aware of that, and here I was wondering why my SSE code was 
slower than the FPU in certain places on my core2 quad, I now recall 
using a lot of movups instructions, thanks for the tip.