D parsing
Chad Joan
chadjoan at gmail.com
Mon Nov 4 22:36:46 PST 2013
On Friday, 1 November 2013 at 21:22:46 UTC, Andrei Alexandrescu
wrote:
> ...
> Bugs stopping Pegged from going forward should receive high
> priority. I also encourage others to participate to that and
> similar work.
>
>
> Andrei
Ack! I share Philippe's sense of timing. This would have been
even nicer to hear a year ago when both of us were actively
committing ;)
I was going to close a project or two that I care deeply about
and had started a long time ago, but I see that this might be a
hard decision.
Nonetheless, I am really glad that you are showing this interest!
I like to hear stuff like this, since I too really like Pegged.
Andrei and Philippe,
I feel compelled to share some of my thoughts that I never had
time to finish back then.
I was working on a parser-engine-as-a-library that could be used
to as optimized internals for Pegged, as well as any other tools
that need to recognize these common patterns.
The idea was to expose an API like this:
string makeParser()
{
auto builder = new ParserBuilder!char;
builder.pushSequence();
builder.literal('x');
builder.pushMaybe();
builder.literal('y');
builder.pop();
builder.pop();
return builder.toDCode("callMe");
}
const foo = makeParser();
pragma(msg, foo);
mixin(foo);
That snippet would create a parser that recognizes the grammar
'x' ( 'y'? ).
The current fledgling implementation creates this parser:
http://pastebin.com/MgSqWXE2
Of course, no one would be expected to write grammars like that.
It would be the job of tools like Pegged or std.regex to package
it up in nice syntax that is easy to use.
The code already takes a somewhat different strategy than
Pegged's original strategy. Rather than generating a bunch of
templates that dmd then has to instantiate to actualize the
parser, it just emits a bunch of very primitive procedural D
code. I suspect that this approach would mixin far faster with
current dmd, because the deeply nested templates generated by
Pegged seemed to be a bottleneck. I have to hand it to Philippe
though for coming up with a very clever way to bootstrap the
thing: once I saw how his templates assembled together, I
realized just how convenient that was!
(And I think my parser generator still has to be tought how to
avoid making redundant branches in its output: there's some hash
table action that belongs in there somewhere.)
The small amount of code that I have for it is here:
https://github.com/chadjoan/xdc/blob/master/src/xdc/parser_builder/parser_builder.d
I wanted to eventually make it generic enough to recognize
patterns in things besides strings. Being able to write grammars
that recognize patterns in ASTs is /useful/. That leads into the
whole xdc project: automate all of the tedious crud in semantic
analysis, and thus make compiler writing, and possibly other AST
manipulations in user code, become all much easier.
The other thing I wanted to do was to optimize it.
- I intended it to do no allocations unless the caller asks for
it.
- I intended to do a bunch of PEG/regex hybridization.
I noticed some mathematical properties of PEGs and regular
expressions that should allow you to mix the two as much as
possible. All you have to do is tell it how to behave at the
boundaries where they meet. And given that PEGs already define
their own behavior pretty well, it would become possible to lower
a lot of a PEG into regular expressions connected with a minimal
set of PEG rules. This would be some awesome lowering: if you
first do a pass that inlines as many rules as possible, and then
do a second pass that converts PEG elements into regular elements
whenever possible, then I feel like the thing will be damned near
optimal. If you are wondering what these mathematical properties
are, then I encourage you to look at this snippet where I define
"unitary" and "non-unitary" expressions, for lack of prior terms:
http://pastebin.com/iYBypRc5
Another fun thought: PEGs can have look-behind that includes any
regular elements without any additional algorithmic complexity.
Just take all of the look-behinds in the grammar, mash them
together into one big regular-expression using regular
alternation (|), and then have the resulting automaton consume in
lock-step with the PEG parser. Whenever the PEG parser needs to
do a lookbehind, it just checks to see if the companion automaton
is in a matching state for the capture it needs.
*sigh*, I feel like I could write a paper on this stuff if I were
in grad school right now. Alas, I am stuck doing 50-60 hours a
week of soul-sucking business programming. Well, then again, my
understanding is that even though I can think of things that seem
like they would make interesting topics for publishable papers,
reality would have the profs conscript me to do completely
different things that are possibly just as inane as the business
programming.
I worry that the greater threat to good AST manipulation tools in
D is a lack of free time, and not the DMD bugs as much.
I hope this is useful to someone!
More information about the Digitalmars-d
mailing list