[Issue 21890] New: Memory layout and access patterns in the backend's COFF implementation lead to apocalyptically terrible LLC misses.
d-bugmail at puremagic.com
d-bugmail at puremagic.com
Mon May 3 17:09:59 UTC 2021
https://issues.dlang.org/show_bug.cgi?id=21890
Issue ID: 21890
Summary: Memory layout and access patterns in the backend's
COFF implementation lead to apocalyptically terrible
LLC misses.
Product: D
Version: D2
Hardware: x86
OS: Windows
Status: NEW
Keywords: backend, performance
Severity: normal
Priority: P1
Component: dmd
Assignee: nobody at puremagic.com
Reporter: maxhaton at gmail.com
TL;DR dmd doesn't utilize the caches very well at all, this makes it slow.
I just pointed Intel's vTune profiler at dmd compiling itself, turns out that
out of a ~4 seconds, the hotspots are
Top Hotspots:
Function Module CPU Time
MsCoffObj_getsegment dmdforprofilegc.exe 0.626s
dmd.func.traverseIndirections.traverse dmdforprofilegc.exe 0.245s
ScopeDsymbol::search dmdforprofilegc.exe 0.146s
Type::addMod dmdforprofilegc.exe 0.086s
func at 0x18002b650 ntdll.dll 0.085s
[Others] N/A* 3.261s
Something is going wrong with MsCoffObj_getsegment. Further investigation
suggests almost all the measured direct-to-DRAM hits happen only in this
function.
Function / Call Stack CPU Time Memory Bound Loads Stores
LLC Miss Count Average Latency (cycles)
MsCoffObj_getsegment 626.000ms 70.0% 397,211,916 1,200,036
32,402,268 49
Going deeper the offending line seems to be the following:
------------------------------
@trusted
segidx_t MsCoffObj_getsegment(const(char)* sectname, uint flags)
{
//printf("getsegment(%s)\n", sectname);
assert(strlen(sectname) <= 8); // so it won't go into string_table
if (!(flags & IMAGE_SCN_LNK_COMDAT))
{
for (segidx_t seg = 1; seg < SegData.length; seg++)
{ seg_data *pseg = SegData[seg];
if (!(ScnhdrTab[pseg.SDshtidx].Characteristics &
IMAGE_SCN_LNK_COMDAT) && <---- THIS LINE HERE
strncmp(cast(const(char)* )ScnhdrTab[pseg.SDshtidx].Name,
sectname, 8) == 0)
{
//printf("\t%s\n", sectname);
return seg; // return existing segment
}
}
}
segidx_t seg = MsCoffObj_getsegment2(MsCoffObj_addScnhdr(sectname, flags));
//printf("\tSegData.length = %d\n", SegData.length);
//printf("\tseg = %d, %d, %s\n", seg, SegData[seg].SDshtidx,
ScnhdrTab[SegData[seg].SDshtidx].s_name);
return seg;
}
------------------------------
Disclaimer: vTune has a habit of bumping performance counter data into the next
instructions total, so these cache misses could be from the SegData access also
(I will investigate further).
Although the sheer volume may be an issue in itself (I haven't instrumented
this yet), this code is a n-tuple threat:
- SegData's struct is 140 bytes in size, and SDshtidx is at index 64, so the
cache efficiency is not good.
- This is a memory dependency chain which isn't ideal.
- This assessed value is then used as an index into ScnhdrTab, which is an
array of structs of size 40, whose last element is then indexed. This means
that due to the allocator alignment it is quite likely that only one
.Characteristics can actually fit into one cacheline.
Solution: Switch the data structure/s from Array of struct to struct of arrays.
This could yield somewhere on the order of 4x to 16x speedup if my data and
hypothesis are correct. I've given the rest of the code a skim and it seems
like this change shouldn't pessimize the rest of the code, as the buffers are
mostly accessed sparsely (most accesses are only to 1 member at a time), the
actual code itself should be able to stay the same shape thanks to D's usual
bag of tricks.
Also known as https://en.wikipedia.org/wiki/Data-oriented_design
Additional Information:
-- Profiled build was the result of "ldc2 -O3", so no PGO or LTO.
-- Further investigation of the length of SegData yields a number of roughly
59000 by the end of compilation, however this function seems to be called a lot
also so there I think the combination of this loop and a loop in a caller is
yielding Ω(n^2) runtime.
--
More information about the Digitalmars-d-bugs
mailing list