xdc: A hypothetical D cross-compiler and AST manipulation tool.

Fri Nov 8 20:46:08 PST 2013

Many vendors would have their processors supported in D if we had
a D to C compiler. I feel like it would be simpler than going for
native code directly. Did this idea follow-through?

On Thursday, 18 July 2013 at 01:21:44 UTC, Chad Joan wrote:
> I'd like to present my vision for a new D compiler.  I call it 
> xdc, a loose abbreviation for "Cross D Compiler" (if confused, 
> see 
> http://english.stackexchange.com/questions/37394/why-do-some-words-have-x-as-a-substitute).
>  It could also mean other fun things like "Crossbone D 
> Compiler" (I imagine a logo with some crossbones having a metal 
> D atop where the skull normally goes), "eXperimental D 
> Compiler", or something sounding like "ectasy" ;)
>
> We usually think of language features as being rewritten into 
> simpler features.  The simple features eventually get rewritten 
> into machine instructions.  Compilers are, fundamentally, 
> responsible for performing "lowering" operations.
>
> It makes sense to me, then, to make a compiler whose internals 
> /look/ like a bunch of these rewrites and lowering operations.  
> There should be a bunch of input patterns matched to the 
> desired results.  This has the happy side-effect of giving us a 
> pleasant way to do AST manipulations from within D code.
>
> I've also had a long-standing desire to see D on many more 
> platforms.  It should make an appearance on console platforms 
> and on smartphones.  I've tried doing this with a retargetable 
> compiler like GCC before, and the work was surprisingly large.  
> Even if the compiler already emits code for the target system's 
> CPU, there are still a large number of details involving 
> calling conventions, executable format, and any number of 
> CPU/OS specific tweaks to object file output.  It makes a lot 
> more sense to me to just output C/C++ code and feed that to a 
> native toolchain.  That would skip a lot of the 
> platform-specific nonsense that creates a barrier to entry for 
> people who, say, just want to write a simple game for 
> Android/iPhone/PS(3|4)/etc in D, and don't want to become 
> compiler experts first.  Ideally, some day, this compiler would 
> also emit code or bytecode for Javascript, AS3/Flash, Java, and 
> any other popular targets that the market desires.  This can 
> probably be done with DMD, but I'd like to make the process 
> more approachable, and make backend authoring as easy as 
> possible.  It should be possible (and easy) to tell the 
> compiler exactly what lowerings should be applied before the 
> AST hits the backend.
>
> xdc should bring all of that cross-platform targeting together 
> with a compiler infrastructure that can blow everything else 
> away (I hope!).
>
> xdc is my dream for a D compiler that gives us our first step 
> (of few) towards having what haXe has already 
> (http://haxe.org/) : a compiler that can land code just about 
> anywhere.
>
> What follows is a collection of my thoughts written down as 
> notes.
>
> == Ideal Outcomes ==
>
> .- D to C/C++ compiler that can easily reach target platforms 
> that are
> .    currently either unsupported or poorly supported by 
> current D
> .    compilers.
> .   - Useful for game developers wishing to write D code on the
> .       next big console platform.
> .   - Useful for embedded systems developers wishing to write D 
> code
> .       on obscure or potentially proprietary microcontrollers.
>
> .- Other backends (ex: llvm, Java bytecode, AS3/Flash bytecode, 
> etc)
> .    possible in the future.  Community desires considered when
> .    selecting new backend targets.
>
> .- Interpreter backend: a notable backend that would be 
> implemented as
> .    a necessity for making CTFE work.  A dedicated interpreter
> .    backend would hopefully be much faster and easier on 
> memory than
> .    DMD's souped-up constant folder.  (Even though DMD has 
> become
> .    considerably better at this in the past year or two.)
>
> .- Abstract Syntax Tree (AST) manipulation toolchain, possibly 
> usable
> .    in CTFE.  It would work like this:
> .    (1) Pass some D code or Domain Specific Language (DSL) of 
> your
> .          choice (as text) into xdc at compile-time.
> .    (2) xdc returns an AST.
> .    (3) Use xdc's pattern-matching and substitution DSL to
> .          manipulate the AST.
> .    (4) xdc consumes the AST and emits modified D code.
> .    (5) mixin(...) the result.
> .   - If xdc is the compiler used to execute itself in CTFE, 
> then
> .       it might be possible to optimize this by having it 
> expose
> .       itself as a set of intrinsics.
>
> .- Reference counting available by default on all platforms.
> .   - Gets you into the action with minimal effort and little 
> or no
> .       compiler hacking. (More complete GC tends to require 
> platform
> .       specific ASM and/or operating system API support).
>
> .- Full garbage collection available if supported.
> .   - Ex: The C backends would default to ref-counting until 
> the ASM
> .       and OS-level code is written to support full GC.
> .   - Ex: A Java backend would probably use the Java JVM by 
> default.
>
> .- Threading model determined by compiler configuration or 
> available
> .    platform hints.
> .   - Ex: The user may have a posix-threads implementation 
> available,
> .       but know little other details about the target system.  
> It
> .       should be possible for xdc to use pthreads to emulate 
> the
> .       TLS and synchronization mechanisms needed to make D 
> tick.
> .       (Or at least emulate as many features as possible.)
> .   - Ex: Possible "no threading" target for people who don't 
> need
> .       threading but DO need other D features to be available 
> NOW
> .       on an alien platform.  Errors when the source code 
> passed
> .       into xdc assumes that threading features are present.
>
> .- D compiler that is easy to hack on.
> .   - "Looks like the problem it solves."
> .       (To quote Walter's DConf2013 keynote.)
> .   - Made of a bunch of patterns that describe
> .       code rewrites/lowerings.
> .   - Few or no null value checks necessary.
> .      - null checks don't look like pattern matching or 
> lowering.
> .   - Few or no convoluted if-while-if-for-etc nests.
> .      - These also don't look like pattern matching or 
> lowering.
> .   - It should be largely made of "pattern handlers" (see 
> below).
> .   - Each pattern handler will have one part that closely 
> resembles
> .       the AST fragment for the D code that it recognizes, and
> .       another part that resembles the lowered form that it 
> outputs.
> .   - Dependency analysis that prevents your AST manipulation 
> from
> .       happening either too early or too late.
> .   - Because the code that actually does lowerings is 
> generated from
> .       a DSL, it is possible to make it automate a lot of 
> tedious
> .       tasks, like updating the symbol table when nodes are 
> added or
> .       removed from the AST.
> .   - This makes it easier to test experimental features.
>
> .- A step-by-step view of what the compiler is doing to your 
> code.
> .   - Since the semantic analysis of xdc would be composed of
> .      "pattern handlers" (see below), then each time one of 
> them
> .      completes the compiler could output the result of calling
> .      .toString() (or .toDCode() or whatever) on the entire 
> AST.
> .   - This could be attached to an ncurses interface that would 
> be
> .      activated by passing a flag to the compiler, which would 
> then
> .      proceed to show the AST at every stage of compilation.
> .      Press ENTER to see the next step, etc.
> .   - This could also be exposed as API functionality that IDEs 
> could
> .      use to show developers how the compiler sees their code.
>
> .- D code analysis engine that might be usable to automatically
> .    translate D1 code into D2 code, or maybe D2 into D3 in the 
> far
> .    future.
>
> == Architectural Overview ==
>
> .- xdc will largely consist of "pattern handlers" that recognize
> .    patterns in its AST and replace them with AST fragments 
> that
> .    contain successively fewer high-level features (lowering).
> .   - These pattern handlers would feature a DSL that should 
> make
> .       the whole task fairly easy.
> .   - The DSL proposed would be similar to regular expressions 
> in
> .       semantics but different in syntax.
> .      - It will have operators for choice, repetition, optional
> .          matches, capturing, and so on.
> .      - The DSL must support nested structures well.
> .      - The DSL must support vertical layout of patterns well.
> .      - Because of the vertical patterns, most operators will 
> either
> .          be prefix or will be written in block style:
> .          some_block_header { block_stmt1; block_stmt2; etc; }
> .      - Actions like entering and leaving nodes are given 
> their own
> .          special syntax.  The machine will treat them like 
> tokens
> .          that can be matched the same as any AST node.  
> Notably,
> .          node-entry and node-exit do not require introducing
> .          non-regular elements to the DSL.  node-entry and 
> node-exit
> .          may be subsumed into Deterministic Finite Automatons 
> (DFAs).
> .   - An example pattern handler might look like this:
>
> const lowerWhileStatement =
> {
> 	// Apologies in advance if this isn't actually valid D code:
> 	//   This is a design sketch and I currently don't have a way 
> to compile it.
> 	//
> 	// The Pattern template, PatternMatch template, and 
> PatternHandler class
> 	//   have not yet been written.  This is an example of how I 
> might expect
> 	//   them to be used.
> 	//
>
> 	auto consumes = "while_statement";
> 	auto produces = "if_statement","goto","label");
> 	
> 	auto recognizer = Pattern!
> 		"WhileStatement has
> 		{
> 			// Capture the conditional expression (call it \"expr\") and
> 			//   capture the loop body (call it \"statement\").
> 			.expression $expr;
> 			.statement  $statement has
> 			{
> 				// Capture any continue/break statements.
> 				any_amount_of {
> 					any_amount_of .; // Same as .* in regexes.
> 					one_of
> 					{
> 						ContinueStatement $continues;
> 						BreakStatement    $breaks;
> 					}
> 				}
> 				any_amount_of .;
> 			}
> 		}";
> 	
> 	auto action = (PatternMatch!(recognizer) m)
> 	{
> 		m.captures.add("uniqLoopAgain", 
> getUniqueLabel(syntaxNode.enclosingScope))
> 		m.captures.add("uniqExitLoop", 
> getUniqueLabel(syntaxNode.enclosingScope))
> 		
> 		// The "recognizes" clause defines m.getCapture!"continues" 
> with:
> 		//   "ContinueStatement $continues;"
> 		// That line appears in a repitition context 
> ("any_amount_of") and is
> 		//   therefore typed as an array.
> 		foreach( ref node; m.getCapture!"continues" )
> 			node.replaceWith( m, "GotoStatement has $uniqLoopAgain" )
> 		
> 		// Ditto for m.getCapture!"breaks" and "BreakStatement 
> $breaks;".
> 		foreach( ref node; m.getCapture!"breaks" )
> 			node.replaceWith( m, "GotoStatement has $uniqExitLoop" )
> 	};
> 	
> 	auto synthesizer = Pattern!
> 		"Label has $uniqLoopAgain
> 		IfStatement has
> 		{
> 			OpNegate has $expr
> 			GotoStatement has $uniqExitLoop
> 		}
> 		$statement
> 		GotoStatement has $uniqLoopAgain
> 		Label has $uniqExitLoop
> 		";
>
> 	return new PatternHandler(produces, consumes, recognizer, 
> action, synthesizer);
> };
>
> (Also available at: http://pastebin.com/0mBQxhLs )
>
> .- Dispatch to pattern handlers is performed by the execution 
> of a
> .    DFA/Packrat hybrid instead of the traditional OOP 
> inheritance
> .    with method calls.
> .   - Each pattern handler's recognizer gets treated like a 
> regex
> .       or Parsing Expression Grammar (PEG) fragment.
> .   - All of the recognizers in the same semantic pass are 
> pasted
> .       together in an ordered-choice expression.  The ordering 
> is
> .       determined by dependency analysis.
> .   - A recognizer's pattern handler is invoked when the 
> recognizer's
> .       AST expression is matched.
> .   - Once any substitutions are completed, then the machine 
> executing
> .       the pattern engine will set its cursor to the beginning 
> of
> .       the newly substituted AST nodes and continue running.
> .   - Executing potentially hundreds of pattern handlers in a 
> single
> .       ordered-choice expression would be obnoxious for a 
> packrat
> .       parser (packrat machine?).  Thankfully, ordered-choice 
> is
> .       possible in regular grammars, so it can be lowered into 
> regex
> .       operations and the whole thing turned into a DFA.
> .   - If pattern recognizers end up needing recursive elements,
> .       then they will probably not appear at the very 
> beginning of
> .       the pattern.  Patterns with enough regular elements at 
> the
> .       start will be able to merge those regular elements into 
> the
> .       DFA with the rest of the pattern recognizers, and it all
> .       becomes very fast table lookups in small tables.
>
> .- This compiler would involve the creation of a 
> parser-generator
> .    API that allows code to programmatically create grammars, 
> and
> .    to do so without a bunch of clumsy string formatting and 
> string
> .    concatenation.
> .   - These grammars could be written such that things like AST 
> nodes
> .       are seen as terminals.  This expands possibilities and 
> allows
> .       all of the pattern handlers to be coalesced into a 
> grammar
> .       that operates on ASTs and fires off semantic actions 
> whenever
> .       one of the recognizer patterns gets tickled by the 
> right AST
> .       fragment.
> .   - Using strings as terminals is still cool; and necessary 
> for
> .       xdc's text/D-code parser.
> .   - A simple parser-generator API example:
>
> ---------------------------------------
> string makeParser()
> {
> 	auto builder = new ParserBuilder!char;
> 	builder.pushSequence();
> 		builder.literal('x');
> 		builder.pushMaybe();
> 			builder.literal('y');
> 		builder.pop();
> 	builder.pop();
> 	return builder.toDCode("callMe");
> }
>
> const foo = makeParser();
>
> pragma(msg, foo);
> ---------------------------------------
> Current output:
> http://pastebin.com/V3E0Ubbc
> ---------------------------------------
>
> .   - Humans would probably never directly write grammars using 
> this
> .       API; it is intended for use by code that needs to write
> .       grammars.  xdc would be such code: it's given a bunch of
> .       pattern handlers and needs to turn them into a grammar.
> .   - This API could also make it easier to write the parser
> .       generators that humans /would/ use. For example, it 
> could be
> .       used as an experimental backend for a regular expression
> .       engine that can handle limited recursion.
> .   - The packrats usually generated from PEGs are nice and 
> all, but
> .       I'd really like to generate DFAs whenever possible, 
> because
> .       those seem to be regarded as being /very fast/.
> .   - DFAs can't handle the recursive elements of PEGs, but they
> .       should be able to handle everything non-recursive that
> .       precedes or follows the recursive elements.
> .   - The parser-generator API would be responsible for 
> aggressively
> .       converting PEG-like elements into regex/DFA elements 
> whenever
> .       possible.
> .   - Regular expressions can be embedded in PEGs as long as 
> you tell
> .       them how much text to match.  You have to give them 
> concrete
> .       success/failure conditions that can be determined 
> without
> .       help from the rest of the PEG: things like "match as 
> many
> .       characters as possible" or "match as few characters as
> .       possible".  Without that, the regex's backtracking 
> (DFA'd
> .       or otherwise) won't mesh with the PEG.  Give it a 
> concrete
> .       win/fail condition, however, and the embedded regex 
> becomes
> .       just another PEG building block that chews through some
> .       source material and yields a yes/no result.  Such 
> regular
> .       expressions allow DFAs to be introduced into a recursive
> .       descent or packrat parser.
> .   - Many PEG elements can be converted into these well-behaved
> .       regular expressions.
> .      - PEG repetition is just regular expression repetition 
> with
> .          a wrapper around it that says "match as many 
> characters
> .          as possible".
> .      - PEG ordered choice can be lowered into regular 
> expression
> .          unordered choice, which can then be converted into 
> DFAs:
> .          I suspect that this is true: (uv/xy)c == 
> (uv|(^(uv)&xy))c
> .          (or, by De Morgan's law: (uv/xy)c == 
> (uv|(^(uv|^(xy))))c )
> .          & is intersection.
> .          ^ is negation.
> .          Each letter (u,v,x,y,c) can be a subexpression
> .            (non-recursive).
> .      - PEG label matching can be inlined up to the point where
> .          recursion occurs, thus allowing more elements to be
> .          considered for DFA conversion.
> .      - etc.
>
> .- The parser would be defined using a PEG (most likely using 
> Pegged
> .    specifically).
> .   - Although Pegged is an awesome achievement, I suspect its 
> output
> .       could be improved considerably.  The templated code it
> .       generates is slow to compile and ALWAYS allocates parse
> .       tree nodes at every symbol.
> .   - I want to experiment with making Pegged (or a branch of 
> it) emit
> .       DFA/Packrat parser hybrids.  This could be done by 
> making a
> .       version of Pegged that uses the aforementioned
> .       parser-generator API to create its parsers.
> .   - Design principle:  avoid memory allocations like the 
> plague.
> .       The output should be a well-pruned AST, and not just a 
> parse
> .       tree that causes a bunch of allocations and needs 
> massaging to
> .       become useful.
> .   - I really like Pegged and would contribute this stuff 
> upward, if
> .       accepted.
>
> .- When hacking on xdc, you don't need to be aware of WHEN your 
> code
> .    code gets executed in semantic analysis.  The dependency 
> analysis
> .    will guarantee that it always gets performed both
> .    (a) when it's needed, and (b) when it has what it needs.
> .   - This is what the "consumes" and "produces" variables are 
> all
> .       about in the above example.
>
> .- Successfully lowering a D AST into the target backend's 
> input will
> .    almost certainly require multiple passes.  xdc's dependency
> .    analyzer would automatically minimize the number of passes 
> by
> .    looking for patterns that are "siblings" in the dependency 
> graph
> .    (eg. neither depends on the other) and bunching as many 
> such
> .    patterns as possible into each pass.
> .   - It really shouldn't generate very many more than the 
> number of
> .       passes that DMD has coded into it.  Ideally: no more 
> than DMD,
> .       if not fewer.
> .   - I'd like to make the dependency analyzer output a graph 
> that
> .       can be used to track which patterns cause which passes 
> to
> .       exist, and show which patterns are in which passes.
>
> .- Planned availability of backends.
> .   - My first pick for a backend would be an ANSI C89 target.  
> I feel
> .       that this would give it the most reach.
> .   - The interpreter backend is along for the ride, as 
> mentioned.
> .   - Because the semantic analysis is composed of distinct and
> .       loosely-coupled patterns, it is possible for xdc to 
> generate
> .       an analysis chain with the minimum number of lowerings 
> needed
> .       for a given backend.
> .      - The interpreter backend would benefit from having the 
> most
> .          lowerings.  By requiring a lot of lowering, the 
> interpreter
> .          would only need to support a small number of 
> constructs:
> .         - if statements
> .         - gotos
> .         - function calls
> .         - arithmetic expression evaluation
> .         - builtin types (byte, short, int, long, float, 
> double, etc)
> .         - pointers
> .         - Even structs are unnecessary: they can be seen as
> .             typed dereferencing of untyped pointers.
> .      - The C backend would benefit from slightly less 
> lowering than
> .         the interpreter backend.  It is useful for debugging 
> if
> .         you can mostly-sorta read the resulting C code, and 
> your
> .         C compiler will appreciate the extra optimization
> .         opportunities.
> .         - Looping constructs like while and for are welcome 
> here.
> .         - structs would be more readable.
> .      - In the future, a Java or C# backend might use an 
> entirely
> .          different set of lowerings in later passes.
> .         - Pointers are no longer considered "low".
> .         - Classes should be kept as long as possible;
> .             I'm pretty sure they bytecode (at least for Java)
> .             has opcodes dedicated to classes.  Removing them
> .             may cause pessimisation.
> .      - The backend writer should not have to worry about 
> rewriting
> .          the semantic analysis to suit their needs.  They 
> just define
> .          some features and say which ones they need available 
> in the
> .          AST, and xdc's semantic-analysis-generator will 
> handle the
> .          rest.
> .   - Notably, a backend should just be more lowerings, with the
> .       result being text or binary code instead of AST nodes.
> .      - Backends are essentially defined by the set of 
> AST/language
> .          features that they consume and any special lowerings 
> needed
> .          to convert generic AST/language features into
> .          backend-specific AST/language features.
>
>
> == Closing Thoughts ==
>
> I am realizing that there are multiple reasons that compel me 
> to write this document:
> - To share my ideas with others, on the off-chance that someone 
> else might see this vision too and be better equipped to 
> deliver.
> - To suggest capabilities that any community-endorsed compiler 
> tool (ex: compiler-as-a-ctfe-library) should have.
> - To see if I might be able to get the help I need to make it a 
> reality.
>
> I just can't decide which reasons are more important.  But 
> there is a common thread: I want this vision to become reality 
> and do really cool things while filling a bunch of missing 
> links in D's ecosystem.
>
> I have to ask:
>
> Would you pay for this?
> If so, then I might be able to do a kickstarter at some point.
> I am not independently wealthy or retired (or both?) like 
> Walter, nor am I able to survive on zero hours of sleep each 
> night like Andrei, and this would be a big project.  I think it 
> would need full-time attention or it would never become useful 
> in a reasonable timeframe.
>
> Also, assuming you understand the design, are there any gaping 
> holes in this?
> This is my first attempt to share these ideas with a larger 
> group, and thus an opportunity to anticipate troubles.
>
> ...
>
> Well, I'm anxious to see how well the venerable D community 
> receives this bundle of ideas.  Be chatty.  I'll try to keep up.
>
> Thank you for reading.