D compilation is too slow and I am forking the compiler

Fri Nov 23 16:21:53 UTC 2018

On Friday, 23 November 2018 at 14:32:39 UTC, Vladimir Panteleev 
wrote:
> On Friday, 23 November 2018 at 13:23:22 UTC, welkam wrote:
>> If we run these steps in different thread on the same core 
>> with SMT we could better use core`s resources. Reading file 
>> with kernel, decoding UTF-8 with vector instructions and 
>> lexing/parsing with scalar operations while all communication 
>> is done trough L1 and L2 cache.
>
> You might save some pages from the data cache, but by doing 
> more work at once, the code might stop fitting in the 
> execution-related caches (code pages, microcode, branch 
> prediction) instead.

Its not about saving tlb pages or fitting better in cache. 
Compilers are considered streaming applications - they dont 
utilize cpu caches effectively. You cant read one character and 
emit machine code then read next character you have to go over 
all data multiple times while you modify it. I can find white 
papers if you interested where people test GCC with different 
cache architectures and it doesnt make much of a difference. GCC 
is popular application when testing caches.

Here are profiling data from DMD
  Performance counter stats for 'dmd -c main.d':

             600.77 msec task-clock:u              #    0.803 CPUs 
utilized
                  0      context-switches:u        #    0.000 K/sec
                  0      cpu-migrations:u          #    0.000 K/sec
             33,209      page-faults:u             # 55348.333 
M/sec
      1,072,289,307      cycles:u                  # 1787148.845 
GHz
        870,175,210      stalled-cycles-frontend:u #   81.15% 
frontend cycles idle
        721,897,927      stalled-cycles-backend:u  #   67.32% 
backend cycles idle
        881,895,208      instructions:u            #    0.82  insn 
per cycle
                                                   #    0.99  
stalled cycles per insn
        171,211,752      branches:u                # 285352920.000 
M/sec
         11,287,327      branch-misses:u           #    6.59% of 
all branches

        0.747720395 seconds time elapsed

        0.497698000 seconds user
        0.104165000 seconds sys

Most important data in this conversation is 0.82  insn per cycle. 
My CPU could do ~2 IPC so there are plenty of CPU resources 
available. New Intel desktop processors are designed to do 4 
insn/cycle. What is limiting DMD performance is slow RAM, data 
fetching and not what you listed.
code pages - you mean TLB here?

microcode cache. Not all processors have it and those who have 
only benefit trivial loops. DMD have complex loops.

branch prediction. More entries in branch predictor wont help 
here because branches are missed because data is unpredictable 
not because there are too many branches. Also branch 
missprediction penalty is around 30 cycles while reading from RAM 
could be over 200 cycles.

L1 code cache. You didnt mention this but running those tasks in 
SMT mode might trash L1$ so execution might not be optimal.

Instead of parallel reading of imports DMD needs more data 
oriented data structures instead of old OOP inspired data 
structures. Ill give you example why its the case.

Consider
struct {
     bool isAlive;
     <other data at least 7 bytes of size>
}

If you want to read data from that bool CPU needs to fetch 8 
bytes of data(cache line of 64 bits). What this means is that for 
one bit of information CPU fetches 64 bits of data resulting in 
1/64 = 0.015625 or ~1.6 % signal to noise ratio. This is terrible!

AFAIK DMD doesnt make this kind of mistake but its full of large 
structs and classes that are not efficient to read.  To fix this 
we need to split those large data structures into smaller ones 
that only contain what is needed for particular algorithm. I 
predict 2x speed improvement if we transform all data structures 
in DMD. Thats improvement without improving algorithms only 
changing data structures. This getting too longs so i will stop 
right now