SIMD implementation of dot-product. Benchmarks

Sat Aug 17 22:24:43 PDT 2013

On Sunday, 18 August 2013 at 05:07:12 UTC, Manu wrote:
> On 18 August 2013 14:39, Ilya Yaroshenko 
> <ilyayaroshenko at gmail.com> wrote:
>
>> On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
>>
>>> It doesn't look like you account for alignment.
>>> This is basically not-portable (I doubt unaligned loads in 
>>> this context
>>> are
>>> faster than performing scalar operations), and possibly 
>>> inefficient on x86
>>> too.
>>>
>>
>> dotProduct uses unaligned loads (__builtin_ia32_loadups256,
>> __builtin_ia32_loadupd256) and it up to 21 times faster then 
>> trivial scalar
>> version.
>>
>> Why unaligned loads is not-portable and inefficient?
>
>
> x86 is the only arch that can perform an unaligned load. And 
> even on x86
> (many implementations) it's not very efficient.

:(

>
>
>  To make it account for potentially random alignment will be 
> awkward, but it
>>> might be possible to do efficiently.
>>>
>>
>> Did you mean use unaligned loads or prepare data for alignment 
>> loads at
>> the beginning of function?
>>
>
> I mean to only use aligned loads, in whatever way that happens 
> to work out.
> The hard case is when the 2 arrays have different start offsets.
>
> Otherwise you need to wrap your code in a version(x86) block.

Thanks!