SIMD implementation of dot-product. Benchmarks

Sat Aug 17 21:55:24 PDT 2013

On Saturday, 17 August 2013 at 19:38:52 UTC, John Colvin wrote:
> On Saturday, 17 August 2013 at 19:24:52 UTC, Ilya Yaroshenko 
> wrote:
>>> BTW: -march=native automatically implies -mtune=native
>>
>> Thanks, I`ll remove mtune)
>
> It would be really interesting if you could try writing the 
> same code in c, both a scalar version and a version using gcc's 
> vector instrinsics, to allow us to compare performance and 
> identify areas for D to improve.

I am lazy )

I have looked at assembler code:

float, scalar (main loop):
.L191:
	vmovss	xmm1, DWORD PTR [rsi+rax*4]
	vfmadd231ss	xmm0, xmm1, DWORD PTR [rcx+rax*4]
	add	rax, 1
	cmp	rax, rdi
	jne	.L191

float, vector (main loop):
.L2448:
	vmovups	ymm5, YMMWORD PTR [rax]
	sub	rax, -128
	sub	r11, -128
	vmovups	ymm4, YMMWORD PTR [r11-128]
	vmovups	ymm6, YMMWORD PTR [rax-96]
	vmovups	ymm7, YMMWORD PTR [r11-96]
	vfmadd231ps	ymm3, ymm5, ymm4
	vmovups	ymm8, YMMWORD PTR [rax-64]
	vmovups	ymm9, YMMWORD PTR [r11-64]
	vfmadd231ps	ymm0, ymm6, ymm7
	vmovups	ymm10, YMMWORD PTR [rax-32]
	vmovups	ymm11, YMMWORD PTR [r11-32]
	cmp	rdi, rax
	vfmadd231ps	ymm2, ymm8, ymm9
	vfmadd231ps	ymm1, ymm10, ymm11
	ja	.L2448

float, vector (full):
	https://gist.github.com/9il/6258443

It is pretty optimized)

____
Best regards

Ilya