Improving codegen for ARM Cortex-M

Fri Jul 20 11:11:12 UTC 2018

I've finally succeeded in getting a build of my STM32 ARM 
Cortex-M proof of concept in LDC and GDC, thanks to the recent 
changes in both compilers.  So, I now have a way to compare code 
generation between the two compilers.

The project is extremely simple; it just generates a bunch of 
random rectangles on it's small LCD screen.  This is done by 
simply writing to memory in a frame buffer.

Unfortunately, GDC's code executes quite a bit slower than LDC's 
code.  The difference is quite noticeable, as I can see the rate 
of the status LED blinking much slower with GDC than with LDC.

The code to do this is below (I simplified it for this 
discussion, but tested to ensure reproduction of the symptoms.  I 
also did away with the random behavior to remove that variable).

a block of code in main.d
---
uint i = 0;
while(true)
{
     lcd.fillRect(x, y, width, height, color);
     if ((i % 1000) == 0)
     {
         statusLED.toggle();
     }

     i++;
}

in lcd.d
---
@noinline pragma(inline, false) void fillRect(int x, int y, uint 
width, uint height, ushort color)
{
     int y2 = y + height;
     for(int _y = y; _y <= y2; _y++)
     {
         ltdc.fillSpan(x, _y, width, color);
     }
}

from ltdc.d
-----------
void fillSpan(int x, int y, uint spanWidth, ushort color)
{
     int start = y * width + x;
     for(int i = 0; i < spanWidth; i++)
     {
         frameBuffer[start + i] = color;
     }
}

LDC disassembly
---------------
ldc2 -conf= -disable-simplify-libcalls -c -Os  
-mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4 
-Isource/runtime -boundscheck=off

<_D5board3lcd8fillRectFiikktZv>:
80000b8:  e92d 43f0   stmdb  sp!, {r4, r5, r6, r7, r8, r9, lr}
80000bc:  eb03 0e01   add.w  lr, r3, r1
80000c0:  459e        cmp  lr, r3
80000c2:  bfb8        it  lt
80000c4:  e8bd 83f0   ldmialt.w  sp!, {r4, r5, r6, r7, r8, r9, pc}
80000c8:  f8dd c01c   ldr.w  ip, [sp, #28]
80000cc:  ebc3 1503   rsb  r5, r3, r3, lsl #4
80000d0:  f240 0800   movw  r8, #0
80000d4:  f002 0103   and.w  r1, r2, #3
80000d8:  f1a2 0901   sub.w  r9, r2, #1
80000dc:  f2c2 0800   movt  r8, #8192  ; 0x2000
80000e0:  1a54        subs  r4, r2, r1
80000e2:  eb0c 1505   add.w  r5, ip, r5, lsl #4
80000e6:  eb08 0545   add.w  r5, r8, r5, lsl #1
80000ea:  1d2f        adds  r7, r5, #4
80000ec:  b1f2        cbz  r2, 800012c 
<_D5board3lcd8fillRectFiikktZv+0x74>
80000ee:  2500        movs  r5, #0
80000f0:  f1b9 0f03   cmp.w  r9, #3
80000f4:  d30a        bcc.n  800010c 
<_D5board3lcd8fillRectFiikktZv+0x54>
80000f6:  463e        mov  r6, r7
80000f8:  3504        adds  r5, #4
80000fa:  f826 0c02   strh.w  r0, [r6, #-2]
80000fe:  f826 0c04   strh.w  r0, [r6, #-4]
8000102:  8030        strh  r0, [r6, #0]
8000104:  8070        strh  r0, [r6, #2]
8000106:  3608        adds  r6, #8
8000108:  42ac        cmp  r4, r5
800010a:  d1f5        bne.n  80000f8 
<_D5board3lcd8fillRectFiikktZv+0x40>
800010c:  b171        cbz  r1, 800012c 
<_D5board3lcd8fillRectFiikktZv+0x74>
800010e:  ebc3 1603   rsb  r6, r3, r3, lsl #4
8000112:  2901        cmp  r1, #1
8000114:  eb0c 1606   add.w  r6, ip, r6, lsl #4
8000118:  4435        add  r5, r6
800011a:  f828 0015   strh.w  r0, [r8, r5, lsl #1]
800011e:  d005        beq.n  800012c 
<_D5board3lcd8fillRectFiikktZv+0x74>
8000120:  eb08 0545   add.w  r5, r8, r5, lsl #1
8000124:  2902        cmp  r1, #2
8000126:  8068        strh  r0, [r5, #2]
8000128:  bf18        it  ne
800012a:  80a8        strhne  r0, [r5, #4]
800012c:  3301        adds  r3, #1
800012e:  f507 77f0   add.w  r7, r7, #480  ; 0x1e0
8000132:  4573        cmp  r3, lr
8000134:  ddda        ble.n  80000ec 
<_D5board3lcd8fillRectFiikktZv+0x34>
8000136:  e8bd 83f0   ldmia.w  sp!, {r4, r5, r6, r7, r8, r9, pc}

GDC disassembly
---------------
arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs 
-nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 
-mfloat-abi=hard -Isource/runtime -fno-bounds-check 
-ffunction-sections -fdata-sections -fno-weak

<_D5board3lcd8fillRectFiikktZv>:
800049c:  b470        push  {r4, r5, r6}
800049e:  440b        add  r3, r1
80004a0:  4299        cmp  r1, r3
80004a2:  f8bd 500c   ldrh.w  r5, [sp, #12]
80004a6:  dc15        bgt.n  80004d4 
<_D5board3lcd8fillRectFiikktZv+0x38>
80004a8:  ebc1 1401   rsb  r4, r1, r1, lsl #4
80004ac:  eb00 1004   add.w  r0, r0, r4, lsl #4
80004b0:  4c09        ldr  r4, [pc, #36]  ; (80004d8 
<_D5board3lcd8fillRectFiikktZv+0x3c>)
80004b2:  4410        add  r0, r2
80004b4:  ebc2 76c2   rsb  r6, r2, r2, lsl #31
80004b8:  eb04 0440   add.w  r4, r4, r0, lsl #1
80004bc:  0076        lsls  r6, r6, #1
80004be:  b122        cbz  r2, 80004ca 
<_D5board3lcd8fillRectFiikktZv+0x2e>
80004c0:  19a0        adds  r0, r4, r6
80004c2:  f820 5b02   strh.w  r5, [r0], #2
80004c6:  42a0        cmp  r0, r4
80004c8:  d1fb        bne.n  80004c2 
<_D5board3lcd8fillRectFiikktZv+0x26>
80004ca:  3101        adds  r1, #1
80004cc:  428b        cmp  r3, r1
80004ce:  f504 74f0   add.w  r4, r4, #480  ; 0x1e0
80004d2:  daf4        bge.n  80004be 
<_D5board3lcd8fillRectFiikktZv+0x22>
80004d4:  bc70        pop  {r4, r5, r6}
80004d6:  4770        bx  lr
80004d8:  20000000   .word  0x20000000

For both LDC and GDC `fillSpan` gets inlined into `fillRect`.  I 
had to disable inlining for `fillRect` to make it easier to 
compare the disassembly, otherwise all I get is a huge `main`.

I used `O2` for GDC because the `Os` was even slower, and didn't 
inline `fillSpan`.

Although GDC's code is shorter, LDC's code is faster.  My guess 
is that this is due to the `ldm` and `stm` instructions in the 
LDC disassembly which are SIMD instructions (load multiple, and 
store multiple), but I'm not sure.

I've tried a number of different optimization permutations (too 
many to list here), but they didn't seem to make any difference.

I ask for any insight you might have, should you wish to give 
this your attention.  Regardless, I'll keep investigating.

Thanks,
Mike