How to get to a class initializer through introspection?

Wed Aug 5 22:19:11 UTC 2020

On Wednesday, 5 August 2020 at 16:08:59 UTC, Johannes Pfau wrote:
> Am Wed, 05 Aug 2020 14:36:37 +0000 schrieb Johan:
>
>> On Wednesday, 5 August 2020 at 13:40:16 UTC, Johannes Pfau 
>> wrote:
>>>
>>> I'd therefore suggest the following:
>>> 1) Make all init symbols COMDAT: This ensures that if a 
>>> smybol is
>>> actually needed (address taken, real memcpy call) it will be 
>>> available.
>>> But if it is not needed, the compiler does not have to output 
>>> the
>>> symbol.
>>> If it's required in multiple files, COMDAT will merge the 
>>> symbols into
>>> one.
>>>
>>> 2) Ensure the compiler always knows the data of that symbol. 
>>> This probably means during codegen, the initializer should 
>>> never be an external symbol. It needs to be a COMDAT symbol 
>>> with attached initializer expression. And the initializer 
>>> data must always be fully available in .di files.
>>>
>>> The two rules combined should allow the backend to choose the 
>>> initialization method that is most appropriate for the target 
>>> architecture.
>> 
>> What you are suggesting is pretty much exactly what the 
>> compilers already do. Except that we don't expose the 
>> initialization symbol directly to the user (T.init is an 
>> rvalue, and does not point to the initialization symbol), but 
>> through TypeInfo.initializer. Not exposing the initializer 
>> symbol to the user had a nice benefit: for cases where we 
>> never want to emit an initializer symbol (very large structs), 
>> we simply removed that symbol and started doing something else 
>> (memset zero), without breaking any user code. However this 
>> only works for all-zero structs, because TypeInfo.initializer 
>> must return a slice ({init symbol, length}) to data or 
>> {null,length} for all-zero (the 'null' is what we started 
>> making use of). More complex cases cannot elide the symbol.
>> 
>> Initializer functions would allow us to tailor initialization 
>> for more complex cases (e.g. with =void holes, padding 
>> schenanigans, or non-zero-but-repetitive-constant 
>> double[1million] arrays, ...), without having to always 
>> turn-on some backend optimizations (at -O0) and without having 
>> to expose a TypeInfo.initializer slice, but instead exposing a 
>> TypeInfo.initializer function pointer.
>> 
>> -Johan
>
> But initializer symbols are currently not in COMDAT, or does 
> LDC implement that? That's a crucial point, as it addresses 
> Andrei's initializer bloat point. And it also means you can 
> avoid emitting the symbol if it's never referenced. But if it 
> is referenced, it will be available.

It does not matter whether the initializer symbol is in COMDAT, 
because (currently) it has to be dynamically accessible (e.g. by 
a user of a compiled library or e.g. by druntime GC object 
destroy code) and thus cannot be determined whether it is 
referenced at link/compile time.

> Initializer functions have the drawback that backends can no 
> longer choose different strategies for -Os or -O2. All the 
> other benefits you mention (=void holes, padding schenanigans, 
> or non-zero-but-repetitive- constant double[1million] arrays, 
> ...) can also be handled properly by the backend in the 
> initializer-symbol case if the initializer expression is 
> available to the backend. And you have to ensure that the 
> initialization function can always be inlined, so without -O 
> flags it may also lead to suboptimal code...

Backends can also turn an initializer function into a memcpy 
function.
It's perfectly fine if code is suboptimal without -O.
You can simply express more with a function than with a symbol (a 
symbol implies the function "memcpy(all)", whereas a function 
could do that and more).
How would you express =void using a symbol in an object file?

> If the initializer optimizations depend on -O flags, it should 
> also be possible to move the necessary steps in the backend 
> into a different step which is executed even without 
> optimization flags. Choosing to initialize using expressions 
> vs. a symbol should not be an expensive step.

Actually, this does sound like an expensive analysis to me (e.g. 
detecting the case of a large array with repetitive 
initialization inside a struct with a few other members). But 
maybe more practically, is it possible to enable/disable specific 
optimization passes for individual functions with gcc backend at 
-O0? (we can't with LLVM)

> I don't see how an initializer function would be more flexible 
> than that. In fact, you could generate the initializer function 
> in the backend if information about the initialization 
> expression is always preserved. Constructing an initializer 
> function earlier (in the frontend, or D user code) removes 
> information about the target architecture (-Os, memory 
> available, efficient addressing of local constant data, ...). 
> Because of that, I think the backend is the best place to 
> implement this and the frontend should just provide the symbol 
> initializer expression.

I'm a little confused because your last sentence is exactly what 
we currently do, with the terminology:  frontend = dmd code that 
outputs a semantically analyzed AST. Backend = DMD/GCC/LLVM 
codegen. Possibly with "glue layer intermediate representation" 
in-between.
What I thought is discussed in this thread, is that we move the 
complexity out of the compilers (so out of current backends) into 
druntime. For that, I think an initializer function is a good 
solution (similar to emitting a constructor function, rather than 
implementing that codegen inside the backend).

-Johan