SIMD implementation of dot-product. Benchmarks
Ilya Yaroshenko
ilyayaroshenko at gmail.com
Sat Aug 17 22:24:43 PDT 2013
On Sunday, 18 August 2013 at 05:07:12 UTC, Manu wrote:
> On 18 August 2013 14:39, Ilya Yaroshenko
> <ilyayaroshenko at gmail.com> wrote:
>
>> On Sunday, 18 August 2013 at 01:53:53 UTC, Manu wrote:
>>
>>> It doesn't look like you account for alignment.
>>> This is basically not-portable (I doubt unaligned loads in
>>> this context
>>> are
>>> faster than performing scalar operations), and possibly
>>> inefficient on x86
>>> too.
>>>
>>
>> dotProduct uses unaligned loads (__builtin_ia32_loadups256,
>> __builtin_ia32_loadupd256) and it up to 21 times faster then
>> trivial scalar
>> version.
>>
>> Why unaligned loads is not-portable and inefficient?
>
>
> x86 is the only arch that can perform an unaligned load. And
> even on x86
> (many implementations) it's not very efficient.
:(
>
>
> To make it account for potentially random alignment will be
> awkward, but it
>>> might be possible to do efficiently.
>>>
>>
>> Did you mean use unaligned loads or prepare data for alignment
>> loads at
>> the beginning of function?
>>
>
> I mean to only use aligned loads, in whatever way that happens
> to work out.
> The hard case is when the 2 arrays have different start offsets.
>
> Otherwise you need to wrap your code in a version(x86) block.
Thanks!
More information about the Digitalmars-d-announce
mailing list