PEG matching/parsing lib in progress

Mon Nov 8 02:48:20 PST 2010

Hello,

I'm currently writing a PEG matching/parsing library in and for D. This is rather to learn the language better, but it will be published anyway. (Tell me whether there are already, could not find any.)
One key characteristic is: instead of using a meta-grammar (an enhanced dialect of PEG) to write a grammar which then needs to be translated into a parser, one defines the parser directly in D, using Pattern types like Literal, Choice, Option, etc. For instance:
    auto sign = new Option(new Klass("+-"));
    auto digits = new String(new Klass("0-9"));
    auto integer = new Tuple(sign, digits);
    writeln(integer.match("-321"));
    // should write: "integer:(sign:- digits:321)"
In other words: the grammar is the parser. A big advantage is one needs not learn and master a new langage. A big drawback is the grammar is verbose -- but that's ok for me.
Another advantage is one can often express the grammar more elegantly, or solve difficult issues, by using D features inside the grammar:
    lDelim = Character("<"); lDelim = Character(">");
    Node tagSection(name) {return Tuple(lDelim, Literal(name), rDelim);)}
    boldSection = tagSection("b");
(This defines a kind of super-pattern. Also solves the problem that tags must match -- no need to keep a stack of open tags.)

I'm facing a 2 issues, 1 about design, 1 about implementation.

First, you may find it weird that I call "Tuple" what is usually named "Sequence". Actually, a sequence of predefined elements in text, like above, correspond to tuple fields. The result is a composite node, each child of which is a node matched by a sub-pattern. Hope you see what I mean.
"Sequence" would be misleading: in fact, repetitive patterns yield sequences (arrays) of nodes. For instance:
    digits = OneOrMore(digit);
    writeln(digits.match("321"));
    // should write: "digits:[digit:3 digit:2 digit:1]"

For this reason, a TupleNode presently holds an associative array of children nodes, each indexed by the pattern that yielded it. I find this rather sensible, and rich in terms of semantics (but see also below). As a result, when traversing an AST, one can access tuple node fields by key (instead of by index number), which is imo nicer and more expressive. What do you think?

One issue happens when several fields of the tuple actually are of the same kind:
    addition = new Tuple(operand, PLUS, operand);
In fact, the pattern operand here determines the format of each operand without telling apart their distinct *roles*. This issue breaks the above dsign, since both operands would go to the same entry of the associative array. The user must thus write eg:
    rightOperand = leftOperand.copy();
    addition = new Tuple(leftOperand, PLUS, rightOperand);

So, I need to make a choice:
    1. Keep this design, which forces users to copy patterns in such cases.
    2. Let it down, make nodes generated by Tuple patterns hold a plain array of chil nodes.
    3. Have tuple nodes hold both an plain array and an associative array.
    4. ???
I'm currently for option 1., because it make the grammar clearer, even if it requires one more line from the user.

Second, I would like node output (toString) to show pattern *names*, like in the examples.
Do you find this useful? In my practice, it seems a priceles feature, even if it makes output more verbose. It especially helps a lot in setting up correct grammars, commonly a difficult task. Unnamed results miss semantic information. What do you think?
Nodes presently hold a ref to their pattern, right. But I cannot find a way for the patterns to know their own names ;-) I do not want the user to have to name them manually, since they already give the needed information in definitions. But the said info seems inaccessible at runtime.

One apparent solution would be write the grammar itself in an associative array of (name:pattern) pairs, then traverse it to set the names on the patterns themselves. But this does not work because hogher-level patterns need to reference lower-level ones.
Another approach would be to explore the current scope (symbol table). That's what I do win dynamic languages. Is this possible in D?
Can you imagine another solution? (Note: .stringof is not a solution, I guess, since this.stringof yields "this"!)

This feature is also highly useful when writing out patterns themselves: subpatterns are called by their names. For instance, integer shows as "(sign digits)". The only alternative is to fully expand subpatterns, and subsubpattern.., which quickly leads to unreadable mess like regexes. The same advantage helps in having clear match failure error messages: the failing pattern shows in a readible manner.
It would also give an alternative for tuple nodes: keys could be pattern names instead of patterns themselves; which is more flexible since a tree can then be traversed by a routine lying in a scope where pattern definitions are not available (eg the parser can be written in a module).

Thank you,
Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com