Let's stop parser Hell

Sat Jul 7 11:23:43 PDT 2012

> Note that PEG does not impose to use packrat parsing, even 
> though it
> was developed to use it. I think it's a historical 'accident' 
> that put
> the two together: Bryan Ford thesis used the two together.

Interesting. After trying to use ANTLR-C# several years back, I 
got disillusioned because nobody was interested in fixing the 
bugs in it (ANTLR's author is a Java guy first and foremost) and 
the source code of the required libraries didn't have source code 
or a license (wtf.)

So, for awhile I was thinking about how I might make my own 
parser generator that was "better" than ANTLR. I liked the syntax 
of PEG descriptions, but I was concerned about the performance 
hit of packrat and, besides, I already liked the syntax and 
flexibility of ANTLR. So my idea was to make something that was 
LL(k) and mixed the syntax of ANTLR and PEG while using more sane 
(IMO) semantics than ANTLR did at the time (I've no idea if ANTLR 
3 still uses the same semantics today...) All of this is 'water 
under the bridge' now, but I hand-wrote a lexer to help me plan 
out how my parser-generator would produce code. The output code 
was to be both more efficient and significantly more readable 
than ANTLR's output. I didn't get around to writing the 
parser-generator itself but I'll have a look back at my handmade 
lexer for inspiration.

>> However, as I found a few hours ago, Packrat parsing 
>> (typically used to
>> handle PEG) has serious disadvantages: it complicates 
>> debugging because of
>> frequent backtracking, it has problems with error recovery, 
>> and typically
>> disallows to add actions with side effects (because of 
>> possibility of
>> backtracking). These are important enough to reconsider my 
>> plans of using
>> Pegged. I will try to analyze whether the issues are so 
>> fundamental that I
>> (or somebody else) will have to create an ANTLR-like parser 
>> instead, or
>> whether it is possible to introduce changes into Pegged that 
>> would fix these
>> problems.

I don't like the sound of this either. Even if PEGs were fast, 
difficulty in debugging, error handling, etc. would give me 
pause. I insist on well-rounded tools. For example, even though 
LALR(1) may be the fastest type of parser (is it?), I prefer not 
to use it due to its inflexibility (it just doesn't like some 
reasonable grammars), and the fact that the generated code is 
totally unreadable and hard to debug (mind you, when I learned 
LALR in school I found that it is possible to visualize how it 
works in a pretty intuitive way--but debuggers won't do that for 
you.)

While PEGs are clearly far more flexible than LALR and probably 
more flexible than LL(k), I am a big fan of old-fashioned 
recursive descent because it's very flexible (easy to insert 
actions during parsing, and it's possible to use custom parsing 
code in certain places, if necessary*) and the parser generator's 
output is potentially very straightforward to understand and 
debug. In my mind, the main reason you want to use a parser 
generator instead of hand-coding is convenience, e.g. (1) to 
compress the grammar down so you can see it clearly, (2) have the 
PG compute the first-sets and follow-sets for you, (3) get 
reasonably automatic error handling.

* (If the language you want to parse is well-designed, you'll 
probably not need much custom parsing. But it's a nice thing to 
offer in a general-purpose parser generator.)

I'm not totally sure yet how to support good error messages, 
efficiency and straightforward output at the same time, but by 
the power of D I'm sure I could think of something...

I would like to submit another approach to parsing that I dare 
say is my favorite, even though I have hardly used it at all yet. 
ANTLR offers something called "tree parsing" that is extremely 
cool. It parses trees instead of linear token streams, and 
produces other trees as output. I don't have a good sense of how 
tree parsing works, but I think that some kind of tree-based 
parser generator could become the basis for a very flexible and 
easy-to-understand D front-end. If a PG operates on trees instead 
of linear token streams, I have a sneaky suspicion that it could 
revolutionize how a compiler front-end works.

Why? because right now parsers operate just once, on the user's 
input, and from there you manipulate the AST with "ordinary" 
code. But if you have a tree parser, you can routinely manipulate 
and transform parts of the tree with a sequence of independent 
parsers and grammars. Thus, parsers would replace a lot of things 
for which you would otherwise use a visitor pattern, or 
something. I think I'll try to sketch out this idea in more 
detail later.