D parsing

Chad Joan chadjoan at gmail.com
Mon Nov 4 22:36:46 PST 2013


On Friday, 1 November 2013 at 21:22:46 UTC, Andrei Alexandrescu 
wrote:
> ...
> Bugs stopping Pegged from going forward should receive high 
> priority. I also encourage others to participate to that and 
> similar work.
>
>
> Andrei

Ack!  I share Philippe's sense of timing.  This would have been 
even nicer to hear a year ago when both of us were actively 
committing ;)

I was going to close a project or two that I care deeply about 
and had started a long time ago, but I see that this might be a 
hard decision.

Nonetheless, I am really glad that you are showing this interest! 
  I like to hear stuff like this, since I too really like Pegged.


Andrei and Philippe,

I feel compelled to share some of my thoughts that I never had 
time to finish back then.

I was working on a parser-engine-as-a-library that could be used 
to as optimized internals for Pegged, as well as any other tools 
that need to recognize these common patterns.

The idea was to expose an API like this:

     string makeParser()
     {
         auto builder = new ParserBuilder!char;
         builder.pushSequence();
             builder.literal('x');
             builder.pushMaybe();
                 builder.literal('y');
             builder.pop();
         builder.pop();
         return builder.toDCode("callMe");
     }

     const foo = makeParser();

     pragma(msg, foo);

     mixin(foo);

That snippet would create a parser that recognizes the grammar 
'x' ( 'y'? ).
The current fledgling implementation creates this parser:
http://pastebin.com/MgSqWXE2

Of course, no one would be expected to write grammars like that.  
It would be the job of tools like Pegged or std.regex to package 
it up in nice syntax that is easy to use.

The code already takes a somewhat different strategy than 
Pegged's original strategy.  Rather than generating a bunch of 
templates that dmd then has to instantiate to actualize the 
parser, it just emits a bunch of very primitive procedural D 
code.  I suspect that this approach would mixin far faster with 
current dmd, because the deeply nested templates generated by 
Pegged seemed to be a bottleneck.  I have to hand it to Philippe 
though for coming up with a very clever way to bootstrap the 
thing: once I saw how his templates assembled together, I 
realized just how convenient that was!

(And I think my parser generator still has to be tought how to 
avoid making redundant branches in its output: there's some hash 
table action that belongs in there somewhere.)

The small amount of code that I have for it is here:
https://github.com/chadjoan/xdc/blob/master/src/xdc/parser_builder/parser_builder.d

I wanted to eventually make it generic enough to recognize 
patterns in things besides strings.  Being able to write grammars 
that recognize patterns in ASTs is /useful/.  That leads into the 
whole xdc project: automate all of the tedious crud in semantic 
analysis, and thus make compiler writing, and possibly other AST 
manipulations in user code, become all much easier.

The other thing I wanted to do was to optimize it.
- I intended it to do no allocations unless the caller asks for 
it.
- I intended to do a bunch of PEG/regex hybridization.

I noticed some mathematical properties of PEGs and regular 
expressions that should allow you to mix the two as much as 
possible.  All you have to do is tell it how to behave at the 
boundaries where they meet.  And given that PEGs already define 
their own behavior pretty well, it would become possible to lower 
a lot of a PEG into regular expressions connected with a minimal 
set of PEG rules.  This would be some awesome lowering: if you 
first do a pass that inlines as many rules as possible, and then 
do a second pass that converts PEG elements into regular elements 
whenever possible, then I feel like the thing will be damned near 
optimal.  If you are wondering what these mathematical properties 
are, then I encourage you to look at this snippet where I define 
"unitary" and "non-unitary" expressions, for lack of prior terms:
http://pastebin.com/iYBypRc5

Another fun thought: PEGs can have look-behind that includes any 
regular elements without any additional algorithmic complexity.  
Just take all of the look-behinds in the grammar, mash them 
together into one big regular-expression using regular 
alternation (|), and then have the resulting automaton consume in 
lock-step with the PEG parser.  Whenever the PEG parser needs to 
do a lookbehind, it just checks to see if the companion automaton 
is in a matching state for the capture it needs.

*sigh*, I feel like I could write a paper on this stuff if I were 
in grad school right now.  Alas, I am stuck doing 50-60 hours a 
week of soul-sucking business programming.  Well, then again, my 
understanding is that even though I can think of things that seem 
like they would make interesting topics for publishable papers, 
reality would have the profs conscript me to do completely 
different things that are possibly just as inane as the business 
programming.

I worry that the greater threat to good AST manipulation tools in 
D is a lack of free time, and not the DMD bugs as much.

I hope this is useful to someone!


More information about the Digitalmars-d mailing list