Let's stop parser Hell

David Piepgrass qwertie256 at gmail.com
Sat Jul 7 13:26:06 PDT 2012


> 	auto captures = syntaxNode.matchNodes(
> 		TOK_WHILE_NODE,
> 		OP_ENTER_NODE,
> 			OP_CAPTURE(0),
> 			OP_BEGIN,
> 				TOK_EXPRESSION,
> 			OP_END,
> 			OP_CAPTURE(1),
> 			OP_BEGIN,
> 				TOK_STATEMENT,
> 			OP_END,
> 		OP_LEAVE_NODE);

I'm glad to hear you like the tree-parsing approach, Chad, 
although the particular syntax here looks pretty unfriendly :O -- 
does this represent something that you are working on right now?

> This kind of architecture leads to other interesting benefits, 
> like being able to assert which symbols a pattern is designed 
> to handle or which symbols are allowed to exist in the AST at 
> any point in time. Thus if you write a lowering that introduces 
> nodes that a later pass can't handle, you'll know very quickly, 
> at least in principle.
>
> I wanted to make such a front-end so that I could easily make a 
> C backend.  I believe such a compiler would be able to do that 
> with great ease.  I really want a D compiler that can output 
> ANSI C code that can be used with few or no OS/CPU 
> dependencies.  I would be willing to lose a lot of the nifty 
> parallelism/concurrency stuff and deal with reference counting 
> instead of full garbage collection, as long as it lets me 
> EASILY target new systems (any phone, console platform, and 
> some embedded microcontrollers).  Then what I have is something 
> that's as ubiquitous as C, but adds a lot of useful features 
> like exception handling, dynamic arrays, templates, CTFE, etc 
> etc.  My ideas for how to deal with ASTs in pattern recognition 
> and substitution followed from this.

I tend to agree that it would be better to have a "general" node 
class with the node type as a property rather than a subtype and 
rather than a myriad of independent types, although in the past I 
haven't been able to figure out how to make this approach 
simultaneously general, efficient, and easy to use. I'm kind of a 
perfectionist which perhaps holds me back sometimes :)

I'd like to add that if we give tree parsing first-class 
treatment, I believe the most logical approach to parsing has 
three or more stages instead of the traditional two (lex+parse):

1. Lexer
2. Tree-ification
3. Parsing to AST (which may itself use multiple stages, e.g. 
parse the declarations first, then parse function bodies later)

The new stage two simply groups things that are in parenthesis 
and braces. So an input stream such as the following:

A man (from a [very ugly] house in the suburbs) was quoted as 
saying {
     I saw Batman (and Robin) last night!
}

Is converted to a tree where everything parenthesized or braced 
gets to be a child:

A man (
    from a [
        very ugly
    ] house in the suburbs
) was quoted as saying {
    ...
}

Some of the things I like about this approach are:

1. It's language-agnostic. Lots of languages and DSLs could 
re-use exactly the same code from stage 2. (Stage 1, also, is 
fairly similar between languages and I wonder if a parameterized 
standard lexer is a worthwhile pursuit.)

2. It mostly eliminates the need for arbitrary-length lookahead 
for things like D's template_functions(...)(...). Of course, the 
tokens will almost always end up getting scanned twice, but hey, 
at least you know you won't need to scan them more than twice, 
right? (er, of course the semantic analysis will scan it several 
times anyway. Maybe this point is moot.)

3. It is very efficient for tools that don't need to examine 
function bodies. Such tools can easily leave out that part of the 
parser simply by not invoking the function-body sub-parser.

4. It leaves open the door to supporting embedded DSLs in the 
future. It's trivial to just ignore a block of text in braces and 
hand it off to a DSL later. It is similar to the way PEGs allow 
several different parties to contribute parts of a grammar, 
except that this approach does not constrain all the parties to 
actually use PEGs; for instance if I am a really lazy DSL author 
and I already have a SQL parser laying around (whether it's 
LL(k), LALR, whatever), I can just feed the original input text 
to that parser (or, better, use the flat token stream, sans 
comments, that came out of the lexer.)

5. It's risky 'cause I've never heard of anyone taking this 
approach before. Bring on the danger!

I have observed that most PLs (Programming Langs) use one of two 
versions of stage 2: (1) C-style, with structure indicated 
entirely with {}, (), [], and possibly <> (shudder), or (2) 
Python-style, with structure indicated by indentation instead of 
{}. My favorite is the Boo language, which combines these two, 
using Python style by default, but also having a WSA parsing mode 
(whitespace-agnostic) with braces, and switching to WSA mode 
inside a Python-style module whenever the user uses an opener 
("(,{,[") (IIRC). So a single Boo-like stage 2 could be re-used 
for almost all PLs, and thus would be a reasonable candidate as 
part of the standard library or a "D Boost library" (there is not 
such a thing, is there?).

> I wanted to make such a front-end so that I could easily make a 
> C backend.  I believe such a compiler would be able to do that 
> with great ease.
>
> Needing to use D in places where it isn't available is a real 
> pain-point for me right now, and I'll probably find ways to 
> spend time on it eventually.

Yeah, with a tree-transforming parser, I imagine the same thing, 
except my current fetish is to convert a certain subset of D to 
multiple other languages automatically. Then I could write 
libraries that can easily be used by an astonishingly large 
audience. I certainly would like to see D targetting Android, but 
that's best done directly from D to ARM.

Anyway, the devil's in the detail. Originally I wanted to do a 
parser generator and a "completely general AST" in C# and 
couldn't seem to work out the details, but D is more flexible and 
is likely better suited to the task.


More information about the Digitalmars-d mailing list