D compilation is too slow and I am forking the compiler
welkam
wwwelkam at gmail.com
Fri Nov 23 16:21:53 UTC 2018
On Friday, 23 November 2018 at 14:32:39 UTC, Vladimir Panteleev
wrote:
> On Friday, 23 November 2018 at 13:23:22 UTC, welkam wrote:
>> If we run these steps in different thread on the same core
>> with SMT we could better use core`s resources. Reading file
>> with kernel, decoding UTF-8 with vector instructions and
>> lexing/parsing with scalar operations while all communication
>> is done trough L1 and L2 cache.
>
> You might save some pages from the data cache, but by doing
> more work at once, the code might stop fitting in the
> execution-related caches (code pages, microcode, branch
> prediction) instead.
Its not about saving tlb pages or fitting better in cache.
Compilers are considered streaming applications - they dont
utilize cpu caches effectively. You cant read one character and
emit machine code then read next character you have to go over
all data multiple times while you modify it. I can find white
papers if you interested where people test GCC with different
cache architectures and it doesnt make much of a difference. GCC
is popular application when testing caches.
Here are profiling data from DMD
Performance counter stats for 'dmd -c main.d':
600.77 msec task-clock:u # 0.803 CPUs
utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
33,209 page-faults:u # 55348.333
M/sec
1,072,289,307 cycles:u # 1787148.845
GHz
870,175,210 stalled-cycles-frontend:u # 81.15%
frontend cycles idle
721,897,927 stalled-cycles-backend:u # 67.32%
backend cycles idle
881,895,208 instructions:u # 0.82 insn
per cycle
# 0.99
stalled cycles per insn
171,211,752 branches:u # 285352920.000
M/sec
11,287,327 branch-misses:u # 6.59% of
all branches
0.747720395 seconds time elapsed
0.497698000 seconds user
0.104165000 seconds sys
Most important data in this conversation is 0.82 insn per cycle.
My CPU could do ~2 IPC so there are plenty of CPU resources
available. New Intel desktop processors are designed to do 4
insn/cycle. What is limiting DMD performance is slow RAM, data
fetching and not what you listed.
code pages - you mean TLB here?
microcode cache. Not all processors have it and those who have
only benefit trivial loops. DMD have complex loops.
branch prediction. More entries in branch predictor wont help
here because branches are missed because data is unpredictable
not because there are too many branches. Also branch
missprediction penalty is around 30 cycles while reading from RAM
could be over 200 cycles.
L1 code cache. You didnt mention this but running those tasks in
SMT mode might trash L1$ so execution might not be optimal.
Instead of parallel reading of imports DMD needs more data
oriented data structures instead of old OOP inspired data
structures. Ill give you example why its the case.
Consider
struct {
bool isAlive;
<other data at least 7 bytes of size>
}
If you want to read data from that bool CPU needs to fetch 8
bytes of data(cache line of 64 bits). What this means is that for
one bit of information CPU fetches 64 bits of data resulting in
1/64 = 0.015625 or ~1.6 % signal to noise ratio. This is terrible!
AFAIK DMD doesnt make this kind of mistake but its full of large
structs and classes that are not efficient to read. To fix this
we need to split those large data structures into smaller ones
that only contain what is needed for particular algorithm. I
predict 2x speed improvement if we transform all data structures
in DMD. Thats improvement without improving algorithms only
changing data structures. This getting too longs so i will stop
right now
More information about the Digitalmars-d-announce
mailing list