Work on ARM backend for DMD started
H. S. Teoh via Digitalmars-d-announce
digitalmars-d-announce at puremagic.com
Wed Jul 5 16:48:41 PDT 2017
On Tue, Jul 04, 2017 at 05:11:55PM -0700, Walter Bright via Digitalmars-d-announce wrote:
> On 7/4/2017 4:14 PM, H. S. Teoh via Digitalmars-d-announce wrote:
> > Also, loop unrolling is only the beginning. Other loop
> > optimizations are just as important, like strength reduction,
> > hoisting, etc.. (Caveat: I haven't checked whether DMD specifically
> > performs these optimizations.
>
> It does.
>
> > But based on looking at previous dmd output, I'm leaning towards
> > no.)
>
> I wish people would look at it before assuming. It's not like it's a
> secret.
>
> https://github.com/dlang/dmd/blob/master/src/ddmd/backend/gloop.c
>
> Read the comments.
I did a simple test to see which loop optimizations dmd did, vs. gdc.
Here's the test code:
------
int func(int[] data)
{
int i, j;
for (i = 0; i < 10; i++) {
data[i*10] = i;
j = data[0] * 10;
}
return j;
}
void main() {
import std.stdio;
auto data = new int[100];
writeln(func(data));
}
------
Here's the output of dmd -O (git HEAD):
------
0000000000046b00 <_D4test4funcFAiZi>:
46b00: 55 push %rbp
46b01: 48 8b ec mov %rsp,%rbp
46b04: 48 89 fa mov %rdi,%rdx
46b07: 49 89 f1 mov %rsi,%r9
46b0a: 45 31 c0 xor %r8d,%r8d
46b0d: 31 c9 xor %ecx,%ecx
46b0f: 48 63 c1 movslq %ecx,%rax
46b12: 48 3b c2 cmp %rdx,%rax
46b15: 72 11 jb 46b28 <_D4test4funcFAiZi+0x28>
46b17: be 05 00 00 00 mov $0x5,%esi
46b1c: 48 8d 3d cd 53 03 00 lea 0x353cd(%rip),%rdi # 7bef0 <_TMP0>
46b23: e8 64 0a 00 00 callq 4758c <_d_arrayboundsp>
46b28: 45 89 04 81 mov %r8d,(%r9,%rax,4)
46b2c: 48 85 d2 test %rdx,%rdx
46b2f: 75 11 jne 46b42 <_D4test4funcFAiZi+0x42>
46b31: be 06 00 00 00 mov $0x6,%esi
46b36: 48 8d 3d b3 53 03 00 lea 0x353b3(%rip),%rdi # 7bef0 <_TMP0>
46b3d: e8 4a 0a 00 00 callq 4758c <_d_arrayboundsp>
46b42: 41 8b 01 mov (%r9),%eax
46b45: 44 8d 1c 80 lea (%rax,%rax,4),%r11d
46b49: 45 03 db add %r11d,%r11d
46b4c: 83 c1 0a add $0xa,%ecx
46b4f: 41 ff c0 inc %r8d
46b52: 41 83 f8 0a cmp $0xa,%r8d
46b56: 72 b7 jb 46b0f <_D4test4funcFAiZi+0xf>
46b58: 41 8b c3 mov %r11d,%eax
46b5b: 5d pop %rbp
46b5c: c3 retq
46b5d: 00 00 add %al,(%rax)
...
------
Note: for conciseness' sake I omitted the disassembly of main(), since
it's not directly relevant here.
Here are some pertinent points of observation:
- Strength reduction was done, as seen in the line 46b4c: corresponding
with the array index computation i*10.
- Code hoisting was NOT done (in this case): the second line in the loop
body does not depend on the loop index, but dmd did not hoist it out
of the loop. This can be see by the end of loop branch on line 46b56:
the branch destination is 46b0f, near the beginning of the function,
and the code path from there includes the code for the assignment to
j. While some clever tricks were done to avoid using the mul
instruction for computing data[0]*10, this computation was
unfortunately repeated 10 times even though it only needed to be
computed once. In particular, the load of data[0] on line 46b42 is
repeated 10 times, followed by the *10 computation.
- There are two calls to _d_arrayboundsp inside the loop body, along
with branches around them. This seems needless, since one bounds check
ought to be enough to ensure the array lookups are within bounds.
Also, there are 2 branches within the loop body (not counting the
end-of-loop branch), whereas it could have been simplified to one
(less branch hazards on the CPU pipeline).
In comparison, here's the output of gdc -O (gdc 6.3.0):
------
0000000000020080 <_D4test4funcFAiZi>:
20080: 48 85 ff test %rdi,%rdi
20083: 74 33 je 200b8 <_D4test4funcFAiZi+0x38>
20085: 48 89 f9 mov %rdi,%rcx
20088: 49 89 f0 mov %rsi,%r8
2008b: c7 06 00 00 00 00 movl $0x0,(%rsi)
20091: ba 0a 00 00 00 mov $0xa,%edx
20096: b8 01 00 00 00 mov $0x1,%eax
2009b: 48 39 d1 cmp %rdx,%rcx
2009e: 76 18 jbe 200b8 <_D4test4funcFAiZi+0x38>
200a0: 41 89 04 90 mov %eax,(%r8,%rdx,4)
200a4: 83 c0 01 add $0x1,%eax
200a7: 48 83 c2 0a add $0xa,%rdx
200ab: 83 f8 0a cmp $0xa,%eax
200ae: 75 eb jne 2009b <_D4test4funcFAiZi+0x1b>
200b0: 8b 06 mov (%rsi),%eax
200b2: 8d 04 80 lea (%rax,%rax,4),%eax
200b5: 01 c0 add %eax,%eax
200b7: c3 retq
200b8: 48 83 ec 08 sub $0x8,%rsp
200bc: ba 05 00 00 00 mov $0x5,%edx
200c1: bf 06 00 00 00 mov $0x6,%edi
200c6: 48 8d 35 81 7a 07 00 lea 0x77a81(%rip),%rsi # 97b4e <_IO_stdin_used+0x4e>
200cd: e8 8e a4 03 00 callq 5a560 <_d_arraybounds>
------
Comparing this with dmd's output, we see:
- Strength reduction was done on the i*10 computation (line 200a7), just
as in the dmd output.
- Code hoisting was also done (unlike dmd): the computation data[0]*10 was
hoisted out of the loop (line 200b2), and only computed once after the
end of the loop, as opposed to computed 10 times. Notably, we're no
longer loading data[0] 10 times, but just once at the end of the loop.
- One of the bounds checks is moved out of the loop body, so there is
only 1 branch inside the loop (less branch hazards on the CPU
pipeline).
- The function is noticeably smaller than dmd's output, due to gdc
merging the calls to _d_arraybounds into a single path.
Now, granted, my test case could be construed to be unfair, because the
assignment to j depends on the result of the first loop iteration
(data[0] is assigned to before it's read by the assignment to j). So
it's not truly loop-invariant in the strict sense. However, as the gdc
output shows, the compiler ought to be able to refactor things so that
the assignment is moved out of the loop.
So while I was wrong about dmd not doing strength reduction, my
conclusion is still that dmd's codegen for loops leaves more to be
desired. In particular, it doesn't seem to do code hoisting, as least
not for this case, whereas gdc does (and consistently so in other loop
code I've looked at in the past).
T
--
An imaginary friend squared is a real enemy.
More information about the Digitalmars-d-announce
mailing list