Compilation strategy

Tue Dec 18 03:00:07 PST 2012

On Tuesday, 18 December 2012 at 00:15:04 UTC, H. S. Teoh wrote:
> On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshansky 
> wrote:
> [...]
>> I suspect it's one of prime examples where UNIX philosophy of
>> combining a bunch of simple (~ dumb) programs together in 
>> place of
>> one more complex program was taken *far* beyond reasonable 
>> lengths.
>> 
>> Having a pipe-line:
>> preprocessor -> compiler -> (still?) assembler -> linker
>> 
>> where every program tries hard to know nothing about the 
>> previous
>> ones (and be as simple as possibly can be) is bound to get
>> inadequate results on many fronts:
>> - efficiency & scalability
>> - cross-border error reporting and detection (linker errors? 
>> errors
>> for expanded macro magic?)
>> - cross-file manipulations (e.g. optimization, see _how_ LTO 
>> is done in GCC)
>> - multiple problems from a loss of information across pipeline*
>
> The problem is not so much the structure preprocessor -> 
> compiler ->
> assembler -> linker; the problem is that these logical stages 
> have been
> arbitrarily assigned to individual processes residing in their 
> own
> address space, communicating via files (or pipes, whatever it 
> may be).
>
> The fact that they are separate processes is in itself not that 
> big of a
> problem, but the fact that they reside in their own address 
> space is a
> big problem, because you cannot pass any information down the 
> chain
> except through rudimentary OS interfaces like files and pipes. 
> Even that
> wouldn't have been so bad, if it weren't for the fact that user
> interface (in the form of text input / object file format) has 
> also been
> conflated with program interface (the compiler has to produce 
> the input
> to the assembler, in *text*, and the assembler has to produce 
> object
> files that do not encode any direct dependency information 
> because
> that's the standard file format the linker expects).
>
> Now consider if we keep the same stages, but each stage is not a
> separate program but a *library*. The code then might look, in 
> greatly
> simplified form, something like this:
>
> 	import libdmd.compiler;
> 	import libdmd.assembler;
> 	import libdmd.linker;
>
> 	void main(string[] args) {
> 		// typeof(asmCode) is some arbitrarily complex data
> 		// structure encoding assembly code, inter-module
> 		// dependencies, etc.
> 		auto asmCode = compiler.lex(args)
> 			.parse()
> 			.optimize()
> 			.codegen();
>
> 		// Note: no stupid redundant convert to string, parse,
> 		// convert back to internal representation.
> 		auto objectCode = assembler.assemble(asmCode);
>
> 		// Note: linker has direct access to dependency info,
> 		// etc., carried over from asmCode -> objectCode.
> 		auto executable = linker.link(objectCode);
> 		File output(outfile, "w");
> 		executable.generate(output);
> 	}
>
> Note that the types asmCode, objectCode, executable, are 
> arbitrarily
> complex, and may contain lazy-evaluated data structure, 
> references to
> on-disk temporary storage (for large projects you can't hold 
> everything
> in RAM), etc.. Dependency information in asmCode is propagated 
> to
> objectCode, as necessary. The linker has full access to all 
> info the
> compiler has access to, and can perform inter-module 
> optimization, etc.,
> by accessing information available to the *compiler* front-end, 
> not just
> some crippled object file format.
>
> The root of the current nonsense is that perfectly-fine data 
> structures
> are arbitrarily required to be flattened into some kind of 
> intermediate
> form, written to some file (or sent down some pipe), often with 
> loss of
> information, then read from the other end, interpreted, and
> reconstituted into other data structures (with incomplete 
> info), then
> processed. In many cases, information that didn't make it 
> through the
> channel has to be reconstructed (often imperfectly), and then 
> used. Most
> of these steps are redundant. If the compiler data structures 
> were
> already directly available in the first place, none of this 
> baroque
> dance is necessary.
>
>
>> *Semantic info on interdependency of symbols in a source file 
>> is
>> destroyed right before the linker and thus each .obj file is
>> included as a whole or not at all. Thus all C run-times I've 
>> seen
>> _sidestep_ this by writing each function in its own file(!). 
>> Even
>> this alone should have been a clear indication.
>> 
>> While simplicity (and correspondingly size in memory) of 
>> programs
>> was the king in 70's it's well past due. Nowadays I think is 
>> all
>> about getting highest throughput and more powerful features.
> [...]
>
> Simplicity is good. Simplicity lets you modularize a very 
> complex piece
> of software (a compiler that converts D source code into 
> executables)
> into manageable chunks. Simplicity does not require 
> shoe-horning modules
> into separate programs with separate address spaces with 
> separate (and
> deficient) input/output formats.
>
> The problem isn't with simplicity, the problem is with carrying 
> over the
> archaic mapping of compilation stage -> separate program. I 
> mean,
> imagine if std.regex was written so that regex compilation runs 
> in a
> separate program with a separate address space, and the regex 
> matcher
> that executes the match runs in another separate program with a 
> separate
> address space, and the two talk to each other via pipes, or 
> worse,
> intermediate files.
>
> I've mentioned a few times before a horrendous C++ project that 
> I had to
> work with once, where to make a single function call to a 
> particular
> subsystem, it had to go through 6 layers of abstraction, one of 
> which
> was IPC through a local UNIX socket, *and* another of which 
> involved
> fwrite()ing function parameters into a file and fread()ing said
> parameters from the file in another process, with the 6 layers 
> repeating
> in reverse to propagate the return value of the function back 
> to the
> caller.
>
> In the new version of said project, that subsystem exposes a 
> library API
> where to make a function call, you, um, just call the function 
> (gee,
> what a concept).  Needless to say, it didn't take a lot of 
> effort to
> convince customers to upgrade, upon which we proceeded with 
> great relish
> to delete every single source file having to do with that 
> 6-layered
> monstrosity, and had a celebration afterwards.
>
>>From the design POV, though, the layout of the old version of 
>>the
> project utterly made sense. It was superbly (over)engineered, 
> and if you
> made UML diagrams of it, they would be works of art fit for the 
> British
> Museum. The implementation, however, was "somewhat" 
> disappointing.
>
>
> T

IMO, it's not even an issue of the separate address spaces. The 
core problem is the direct result of relying on *archaic file 
formats*.
Simply using serialization of the intermediate data structure 
already solves the data loss problems and all that remains are 
aspects of efficiency which are much less important given current 
compilation speeds. Separate address spaces can be useful if we 
add distributed and concurrent aspects into the mix.