Improving codegen for ARM Cortex-M
Mike Franklin
slavo5150 at yahoo.com
Fri Jul 20 11:11:12 UTC 2018
I've finally succeeded in getting a build of my STM32 ARM
Cortex-M proof of concept in LDC and GDC, thanks to the recent
changes in both compilers. So, I now have a way to compare code
generation between the two compilers.
The project is extremely simple; it just generates a bunch of
random rectangles on it's small LCD screen. This is done by
simply writing to memory in a frame buffer.
Unfortunately, GDC's code executes quite a bit slower than LDC's
code. The difference is quite noticeable, as I can see the rate
of the status LED blinking much slower with GDC than with LDC.
The code to do this is below (I simplified it for this
discussion, but tested to ensure reproduction of the symptoms. I
also did away with the random behavior to remove that variable).
a block of code in main.d
---
uint i = 0;
while(true)
{
lcd.fillRect(x, y, width, height, color);
if ((i % 1000) == 0)
{
statusLED.toggle();
}
i++;
}
in lcd.d
---
@noinline pragma(inline, false) void fillRect(int x, int y, uint
width, uint height, ushort color)
{
int y2 = y + height;
for(int _y = y; _y <= y2; _y++)
{
ltdc.fillSpan(x, _y, width, color);
}
}
from ltdc.d
-----------
void fillSpan(int x, int y, uint spanWidth, ushort color)
{
int start = y * width + x;
for(int i = 0; i < spanWidth; i++)
{
frameBuffer[start + i] = color;
}
}
LDC disassembly
---------------
ldc2 -conf= -disable-simplify-libcalls -c -Os
-mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4
-Isource/runtime -boundscheck=off
<_D5board3lcd8fillRectFiikktZv>:
80000b8: e92d 43f0 stmdb sp!, {r4, r5, r6, r7, r8, r9, lr}
80000bc: eb03 0e01 add.w lr, r3, r1
80000c0: 459e cmp lr, r3
80000c2: bfb8 it lt
80000c4: e8bd 83f0 ldmialt.w sp!, {r4, r5, r6, r7, r8, r9, pc}
80000c8: f8dd c01c ldr.w ip, [sp, #28]
80000cc: ebc3 1503 rsb r5, r3, r3, lsl #4
80000d0: f240 0800 movw r8, #0
80000d4: f002 0103 and.w r1, r2, #3
80000d8: f1a2 0901 sub.w r9, r2, #1
80000dc: f2c2 0800 movt r8, #8192 ; 0x2000
80000e0: 1a54 subs r4, r2, r1
80000e2: eb0c 1505 add.w r5, ip, r5, lsl #4
80000e6: eb08 0545 add.w r5, r8, r5, lsl #1
80000ea: 1d2f adds r7, r5, #4
80000ec: b1f2 cbz r2, 800012c
<_D5board3lcd8fillRectFiikktZv+0x74>
80000ee: 2500 movs r5, #0
80000f0: f1b9 0f03 cmp.w r9, #3
80000f4: d30a bcc.n 800010c
<_D5board3lcd8fillRectFiikktZv+0x54>
80000f6: 463e mov r6, r7
80000f8: 3504 adds r5, #4
80000fa: f826 0c02 strh.w r0, [r6, #-2]
80000fe: f826 0c04 strh.w r0, [r6, #-4]
8000102: 8030 strh r0, [r6, #0]
8000104: 8070 strh r0, [r6, #2]
8000106: 3608 adds r6, #8
8000108: 42ac cmp r4, r5
800010a: d1f5 bne.n 80000f8
<_D5board3lcd8fillRectFiikktZv+0x40>
800010c: b171 cbz r1, 800012c
<_D5board3lcd8fillRectFiikktZv+0x74>
800010e: ebc3 1603 rsb r6, r3, r3, lsl #4
8000112: 2901 cmp r1, #1
8000114: eb0c 1606 add.w r6, ip, r6, lsl #4
8000118: 4435 add r5, r6
800011a: f828 0015 strh.w r0, [r8, r5, lsl #1]
800011e: d005 beq.n 800012c
<_D5board3lcd8fillRectFiikktZv+0x74>
8000120: eb08 0545 add.w r5, r8, r5, lsl #1
8000124: 2902 cmp r1, #2
8000126: 8068 strh r0, [r5, #2]
8000128: bf18 it ne
800012a: 80a8 strhne r0, [r5, #4]
800012c: 3301 adds r3, #1
800012e: f507 77f0 add.w r7, r7, #480 ; 0x1e0
8000132: 4573 cmp r3, lr
8000134: ddda ble.n 80000ec
<_D5board3lcd8fillRectFiikktZv+0x34>
8000136: e8bd 83f0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, pc}
GDC disassembly
---------------
arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs
-nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4
-mfloat-abi=hard -Isource/runtime -fno-bounds-check
-ffunction-sections -fdata-sections -fno-weak
<_D5board3lcd8fillRectFiikktZv>:
800049c: b470 push {r4, r5, r6}
800049e: 440b add r3, r1
80004a0: 4299 cmp r1, r3
80004a2: f8bd 500c ldrh.w r5, [sp, #12]
80004a6: dc15 bgt.n 80004d4
<_D5board3lcd8fillRectFiikktZv+0x38>
80004a8: ebc1 1401 rsb r4, r1, r1, lsl #4
80004ac: eb00 1004 add.w r0, r0, r4, lsl #4
80004b0: 4c09 ldr r4, [pc, #36] ; (80004d8
<_D5board3lcd8fillRectFiikktZv+0x3c>)
80004b2: 4410 add r0, r2
80004b4: ebc2 76c2 rsb r6, r2, r2, lsl #31
80004b8: eb04 0440 add.w r4, r4, r0, lsl #1
80004bc: 0076 lsls r6, r6, #1
80004be: b122 cbz r2, 80004ca
<_D5board3lcd8fillRectFiikktZv+0x2e>
80004c0: 19a0 adds r0, r4, r6
80004c2: f820 5b02 strh.w r5, [r0], #2
80004c6: 42a0 cmp r0, r4
80004c8: d1fb bne.n 80004c2
<_D5board3lcd8fillRectFiikktZv+0x26>
80004ca: 3101 adds r1, #1
80004cc: 428b cmp r3, r1
80004ce: f504 74f0 add.w r4, r4, #480 ; 0x1e0
80004d2: daf4 bge.n 80004be
<_D5board3lcd8fillRectFiikktZv+0x22>
80004d4: bc70 pop {r4, r5, r6}
80004d6: 4770 bx lr
80004d8: 20000000 .word 0x20000000
For both LDC and GDC `fillSpan` gets inlined into `fillRect`. I
had to disable inlining for `fillRect` to make it easier to
compare the disassembly, otherwise all I get is a huge `main`.
I used `O2` for GDC because the `Os` was even slower, and didn't
inline `fillSpan`.
Although GDC's code is shorter, LDC's code is faster. My guess
is that this is due to the `ldm` and `stm` instructions in the
LDC disassembly which are SIMD instructions (load multiple, and
store multiple), but I'm not sure.
I've tried a number of different optimization permutations (too
many to list here), but they didn't seem to make any difference.
I ask for any insight you might have, should you wish to give
this your attention. Regardless, I'll keep investigating.
Thanks,
Mike
More information about the D.gnu
mailing list