Encouraging preliminary results implementing memcpy in D

Mon Jun 18 02:31:25 UTC 2018

On Sunday, 17 June 2018 at 17:00:00 UTC, David Nadlinger wrote:
> On Wednesday, 13 June 2018 at 06:46:43 UTC, Mike Franklin wrote:
>> https://github.com/JinShil/memcpyD
>>
>> […]
>>
>> Feedback, advise, and pull requests to improve the 
>> implementation are most welcome.
>
> The memcpyD implementation is buggy; it assumes that all 
> arguments are aligned to their size. This isn't necessarily 
> true. For example, `ubyte[1024].alignof == 1`, and struct 
> alignment can also be set explicitly using align(N).

Yes, I'm already aware of that.  My plan is to create optimized 
implementations for aligned data, and then handled unaligned data 
as compositions of the various aligned implementations.  For 
example a 3 byte copy would be a short copy plus a byte copy.  
That may not be appropriate for all cases.  I'll have to measure, 
and adapt.

> On x86, you can get away with this in a lot of cases even 
> though it's undefined behaviour [1], but this is not 
> necessarily the case for SSE/AVX instructions. In fact, that's 
> probably a pretty good guess as to where those weird crashes 
> you mentioned come from.

Thanks! I think you're right.

> For loading into vector registers, you can use 
> core.simd.loadUnaligned instead (ldc.simd.loadUnaligned for LDC 
> – LDC's druntime has not been updated yet after {load, 
> store}Unaligned were added upstream as well).

Unfortunately the code gen is quite a bit worse:

Exibit A:
https://run.dlang.io/is/jIuHRG
*(cast(void16*)(&s2)) = *(cast(const void16*)(&s1));

_Dmain:
		push	RBP
		mov	RBP,RSP
		sub	RSP,020h
		lea	RAX,-020h[RBP]
		xor	ECX,ECX
		mov	[RAX],RCX
		mov	8[RAX],RCX
		lea	RDX,-010h[RBP]
		mov	[RDX],RCX
		mov	8[RDX],RCX
		movdqa	XMM0,-020h[RBP]
		movdqa	-010h[RBP],XMM0
		xor	EAX,EAX
		leave
		ret
		add	[RAX],AL
.text._Dmain	ends

Exhibit B:
https://run.dlang.io/is/PLRfhW
storeUnaligned(cast(void16*)(&s2), loadUnaligned(cast(const 
void16*)(&s1)));

_Dmain:
		push	RBP
		mov	RBP,RSP
		sub	RSP,050h
		lea	RAX,-050h[RBP]
		xor	ECX,ECX
		mov	[RAX],RCX
		mov	8[RAX],RCX
		lea	RDX,-040h[RBP]
		mov	[RDX],RCX
		mov	8[RDX],RCX
		mov	-030h[RBP],RDX
		mov	-010h[RBP],RAX
		movdqu	XMM0,[RAX]
		movdqa	-020h[RBP],XMM0
		movdqa	XMM1,-020h[RBP]
		movdqu	[RDX],XMM1
		xor	EAX,EAX
		leave
		ret
		add	[RAX],AL
.text._Dmain	ends

If the code gen was better, that would definitely be the way to 
go; to have unaligned and aligned share the same implementation.  
Maybe I can fix the DMD code gen, or implement a `copyUnaligned` 
intrinsic.

Also, there doesn't seem to be any equivalent 32-byte 
implementations in `core.simd`.  Is that just because noone's 
bother to implement them yet?  And with AVX512, we should 
probably have 64-byte implementations as well.

Mike