stride in slices

Tue Jun 5 20:07:06 UTC 2018

On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
> For loops HAVE a direct cpu semantic! Do you doubt this?

...

Right. If you're gonna keep running your mouth off. How about 
looking at some disassembly then.

for(auto i=0; i<a.length; i+=strideAmount)

Using ldc -O4 -release for x86_64 processors, the initialiser 
translates to:

mov byte ptr [rbp + rcx], 0

The comparison translates to:

cmp r13, rcx
ja .LBB0_2

And the increment and store translates to:

mov byte ptr [rbp + rcx], 0
movsxd rcx, eax
add eax, 3

So. It uses three of the most basic instructions you can think 
of: mov, cmp, j<cond>, add.

Now, what might you ask are the instructions that a range 
compiles down to when everything is properly inlined?

The initialisation, since it's a function, pulls from the stack.

mov rax, qword ptr [rsp + 16]
movsxd rcx, dword ptr [rsp + 32]

But the comparison looks virtually identical.

cmp rax, rcx
jb .LBB2_4

But how does it do the add? With some register magic.

movsxd rcx, edx
lea edx, [rcx + r9]

Now, what that looks like it's doing to me is combing the pointer 
load and index increment in to two those two instructions. One 
instruction less than the flat for loop.

In conclusion. The semantics you talk about are literally some of 
the most basic instructions in computing; and that escaping the 
confines of a for loop for a foreach loop can let the compiler 
generate more efficient code than 50-year-old compsci concepts 
can.