Let's stop parser Hell

Sat Jul 7 13:26:12 PDT 2012

On 07/07/2012 08:55 PM, Chad J wrote:
> On 07/07/2012 02:23 PM, David Piepgrass wrote:
>>> Note that PEG does not impose to use packrat parsing, even though it
>>> was developed to use it. I think it's a historical 'accident' that put
>>> the two together: Bryan Ford thesis used the two together.
>>
>> Interesting. After trying to use ANTLR-C# several years back, I got
>> disillusioned because nobody was interested in fixing the bugs in it
>> (ANTLR's author is a Java guy first and foremost) and the source code of
>> the required libraries didn't have source code or a license (wtf.)
>>
>> So, for awhile I was thinking about how I might make my own parser
>> generator that was "better" than ANTLR. I liked the syntax of PEG
>> descriptions, but I was concerned about the performance hit of packrat
>> and, besides, I already liked the syntax and flexibility of ANTLR. So my
>> idea was to make something that was LL(k) and mixed the syntax of ANTLR
>> and PEG while using more sane (IMO) semantics than ANTLR did at the time
>> (I've no idea if ANTLR 3 still uses the same semantics today...) All of
>> this is 'water under the bridge' now, but I hand-wrote a lexer to help
>> me plan out how my parser-generator would produce code. The output code
>> was to be both more efficient and significantly more readable than
>> ANTLR's output. I didn't get around to writing the parser-generator
>> itself but I'll have a look back at my handmade lexer for inspiration.
>>
>>>> However, as I found a few hours ago, Packrat parsing (typically used to
>>>> handle PEG) has serious disadvantages: it complicates debugging
>>>> because of
>>>> frequent backtracking, it has problems with error recovery, and
>>>> typically
>>>> disallows to add actions with side effects (because of possibility of
>>>> backtracking). These are important enough to reconsider my plans of
>>>> using
>>>> Pegged. I will try to analyze whether the issues are so fundamental
>>>> that I
>>>> (or somebody else) will have to create an ANTLR-like parser instead, or
>>>> whether it is possible to introduce changes into Pegged that would
>>>> fix these
>>>> problems.
>>
>> I don't like the sound of this either. Even if PEGs were fast,
>> difficulty in debugging, error handling, etc. would give me pause. I
>> insist on well-rounded tools. For example, even though LALR(1) may be
>> the fastest type of parser (is it?), I prefer not to use it due to its
>> inflexibility (it just doesn't like some reasonable grammars), and the
>> fact that the generated code is totally unreadable and hard to debug
>> (mind you, when I learned LALR in school I found that it is possible to
>> visualize how it works in a pretty intuitive way--but debuggers won't do
>> that for you.)
>>
>> While PEGs are clearly far more flexible than LALR and probably more
>> flexible than LL(k), I am a big fan of old-fashioned recursive descent
>> because it's very flexible (easy to insert actions during parsing, and
>> it's possible to use custom parsing code in certain places, if
>> necessary*) and the parser generator's output is potentially very
>> straightforward to understand and debug. In my mind, the main reason you
>> want to use a parser generator instead of hand-coding is convenience,
>> e.g. (1) to compress the grammar down so you can see it clearly, (2)
>> have the PG compute the first-sets and follow-sets for you, (3) get
>> reasonably automatic error handling.
>>
>> * (If the language you want to parse is well-designed, you'll probably
>> not need much custom parsing. But it's a nice thing to offer in a
>> general-purpose parser generator.)
>>
>> I'm not totally sure yet how to support good error messages, efficiency
>> and straightforward output at the same time, but by the power of D I'm
>> sure I could think of something...
>>
>> I would like to submit another approach to parsing that I dare say is my
>> favorite, even though I have hardly used it at all yet. ANTLR offers
>> something called "tree parsing" that is extremely cool. It parses trees
>> instead of linear token streams, and produces other trees as output. I
>> don't have a good sense of how tree parsing works, but I think that some
>> kind of tree-based parser generator could become the basis for a very
>> flexible and easy-to-understand D front-end. If a PG operates on trees
>> instead of linear token streams, I have a sneaky suspicion that it could
>> revolutionize how a compiler front-end works.
>>
>> Why? because right now parsers operate just once, on the user's input,
>> and from there you manipulate the AST with "ordinary" code. But if you
>> have a tree parser, you can routinely manipulate and transform parts of
>> the tree with a sequence of independent parsers and grammars. Thus,
>> parsers would replace a lot of things for which you would otherwise use
>> a visitor pattern, or something. I think I'll try to sketch out this
>> idea in more detail later.
>
> I was thinking the same thing.
>
> My intent is to create a kind of regular-expression-of-nodes with
> push/pop operators to recognize ascent and descent on the tree.  Such a
> regular expression would allow one to capture subtrees out of
> generalized patterns and then place them into new trees that then become
> the input for the next pattern or set of patterns.  I think this is much
> closer to how I conceptualize semantic analysis than how semantic
> analysis is done in front ends like DMD: it should to be done with
> pattern recognition and substitution, not with myriad nested
> if-statements and while-loops.
>
> My vision is to have code similar to this in the front-end:
>
> /+
> Lower
>      while ( boolExpr )
>      {
>          statements;
>      }
>
> Into
>
>      loopAgain:
>      if ( !boolExpr )
>          goto exitLoop
>      statements;
>      goto loopAgain
>      exitLoop:
> +/
> void lowerWhileStatement( SyntaxElement* syntaxNode )
> {
>      auto captures = syntaxNode.matchNodes(
>          TOK_WHILE_NODE,
>          OP_ENTER_NODE,
>              OP_CAPTURE(0),
>              OP_BEGIN,
>                  TOK_EXPRESSION,
>              OP_END,
>              OP_CAPTURE(1),
>              OP_BEGIN,
>                  TOK_STATEMENT,
>              OP_END,
>          OP_LEAVE_NODE);
>
>      if ( captures is null )
>          return;
>
>      syntaxNode.replaceWith(
>          LabelNode("loopAgain"),
>          TOK_IF_STATEMENT,
>          OP_INSERT,
>          OP_BEGIN,
>              TOK_NEGATE,
>              OP_INSERT,
>              OP_BEGIN,
>                  captures[0], // Expression
>              OP_END,
>              GotoStatement("exitLoop"),
>          OP_END,
>          captures[1], // statements
>          GotoStatement("loopAgain"),
>          LabelNode("exitLoop")
>          );
> }
>
>
> The specifics will easily change.

I'd suggest:

AstOp!`
Lower
     while ( boolExpr )
     {
         statements;
     }

Into
     loopAgain:
     if ( !boolExpr ) {
         statements;
     } else goto exitLoop
     goto loopAgain
     exitLoop:

`.run(syntaxNode);