[Issue 21890] New: Memory layout and access patterns in the backend's COFF implementation lead to apocalyptically terrible LLC misses.

Mon May 3 17:09:59 UTC 2021

https://issues.dlang.org/show_bug.cgi?id=21890

          Issue ID: 21890
           Summary: Memory layout and access patterns in the backend's
                    COFF implementation lead to apocalyptically terrible
                    LLC misses.
           Product: D
           Version: D2
          Hardware: x86
                OS: Windows
            Status: NEW
          Keywords: backend, performance
          Severity: normal
          Priority: P1
         Component: dmd
          Assignee: nobody at puremagic.com
          Reporter: maxhaton at gmail.com

TL;DR dmd doesn't utilize the caches very well at all, this makes it slow.

I just pointed Intel's vTune profiler at dmd compiling itself, turns out that
out of a ~4 seconds, the hotspots are

Top Hotspots:

Function                                 Module                 CPU Time
MsCoffObj_getsegment                     dmdforprofilegc.exe    0.626s
dmd.func.traverseIndirections.traverse   dmdforprofilegc.exe    0.245s
ScopeDsymbol::search                     dmdforprofilegc.exe    0.146s
Type::addMod                             dmdforprofilegc.exe    0.086s
func at 0x18002b650                         ntdll.dll              0.085s
[Others]                                 N/A*                   3.261s

Something is going wrong with MsCoffObj_getsegment. Further investigation
suggests almost all the measured direct-to-DRAM hits happen only in this
function.

Function / Call Stack   CPU Time        Memory Bound  Loads        Stores     
LLC Miss Count  Average Latency (cycles)
MsCoffObj_getsegment    626.000ms       70.0%         397,211,916  1,200,036  
32,402,268      49

Going deeper the offending line seems to be the following:
------------------------------
@trusted
segidx_t MsCoffObj_getsegment(const(char)* sectname, uint flags)
{
    //printf("getsegment(%s)\n", sectname);
    assert(strlen(sectname) <= 8);      // so it won't go into string_table
    if (!(flags & IMAGE_SCN_LNK_COMDAT))
    {
        for (segidx_t seg = 1; seg < SegData.length; seg++)
        {   seg_data *pseg = SegData[seg];
            if (!(ScnhdrTab[pseg.SDshtidx].Characteristics &
IMAGE_SCN_LNK_COMDAT) && <---- THIS LINE HERE
                strncmp(cast(const(char)* )ScnhdrTab[pseg.SDshtidx].Name,
sectname, 8) == 0)
            {
                //printf("\t%s\n", sectname);
                return seg;         // return existing segment
            }
        }
    }

    segidx_t seg = MsCoffObj_getsegment2(MsCoffObj_addScnhdr(sectname, flags));

    //printf("\tSegData.length = %d\n", SegData.length);
    //printf("\tseg = %d, %d, %s\n", seg, SegData[seg].SDshtidx,
ScnhdrTab[SegData[seg].SDshtidx].s_name);
    return seg;
}
------------------------------

Disclaimer: vTune has a habit of bumping performance counter data into the next
instructions total, so these cache misses could be from the SegData access also
(I will investigate further).

Although the sheer volume may be an issue in itself (I haven't instrumented
this yet), this code is a n-tuple threat:

- SegData's struct is 140 bytes in size, and SDshtidx is at index 64, so the
cache efficiency is not good.
- This is a memory dependency chain which isn't ideal.
- This assessed value is then used as an index into ScnhdrTab, which is an
array of structs of size 40, whose last element is then indexed. This means
that due to the allocator alignment it is quite likely that only one
.Characteristics can actually fit into one cacheline.

Solution: Switch the data structure/s from Array of struct to struct of arrays.
This could yield somewhere on the order of 4x to 16x speedup if my data and
hypothesis are correct. I've given the rest of the code a skim and it seems
like this change shouldn't pessimize the rest of the code, as the buffers are
mostly accessed sparsely (most accesses are only to 1 member at a time), the
actual code itself should be able to stay the same shape thanks to D's usual
bag of tricks. 

Also known as https://en.wikipedia.org/wiki/Data-oriented_design

Additional Information:
-- Profiled build was the result of "ldc2 -O3", so no PGO or LTO.
-- Further investigation of the length of SegData yields a number of roughly
59000 by the end of compilation, however this function seems to be called a lot
also so there I think the combination of this loop and a loop in a caller is
yielding Ω(n^2) runtime.

--