Let's stop parser Hell
Chad J
chadjoan at __spam.is.bad__gmail.com
Sat Jul 7 15:03:15 PDT 2012
On 07/07/2012 03:13 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:
>
>> In this vision I do not use classes and inheritance for my AST.
>> Instead I use structs that contain some kind of nodeType member that
>> would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE
>> in the above code. Dynamic dispatch is instead performed by (very
>> fast) DFAs recognizing parts of the AST.
>
> Exactly. This idea first came to me in April after I implemented the
> first top-down recursive descent custom parser for a D subset. I tried
> Visitor pattern before that and wasn't happy. There are some subtle
> difficulties which I believe will be possible to overcome, most
> important being the need to introduce a mechanism for hierarchical
> classification (like a pow expression being an assign expression at the
> same time).
>
From some of my earlier scribblings:
enum : SyntaxElement
{
AST_EXPRESSION = 0x0001_0000_0000_0000,
AST_UNARY_EXPR = 0x0000_0001_0000_0000 | AST_EXPRESSION,
AST_NEGATE_EXPR = 0x0000_0000_0001_0000 | AST_UNARY_EXPR,
AST_COMPLIMENT_EXPR = 0x0000_0000_0002_0000 | AST_UNARY_EXPR,
AST_POST_ADD_EXPR = 0x0000_0000_0003_0000 | AST_UNARY_EXPR,
AST_POST_SUB_EXPR = 0x0000_0000_0004_0000 | AST_UNARY_EXPR,
AST_PRE_ADD_EXPR = 0x0000_0000_0005_0000 | AST_UNARY_EXPR,
AST_PRE_SUB_EXPR = 0x0000_0000_0006_0000 | AST_UNARY_EXPR,
AST_BINARY_EXPR = 0x0000_0002_0000_0000 | AST_EXPRESSION,
AST_AND_EXPR = 0x0000_0000_0001_0000 | AST_BINARY_EXPR,
AST_OR_EXPR = 0x0000_0000_0002_0000 | AST_BINARY_EXPR,
AST_XOR_EXPR = 0x0000_0000_0003_0000 | AST_BINARY_EXPR,
AST_AND_AND_EXPR = 0x0000_0000_0004_0000 | AST_BINARY_EXPR,
AST_OR_OR_EXPR = 0x0000_0000_0005_0000 | AST_BINARY_EXPR,
AST_ADD_EXPR = 0x0000_0000_0006_0000 | AST_BINARY_EXPR,
AST_TRINARY_EXPR = 0x0000_0003_0000_0000 | AST_EXPRESSION,
AST_NARY_EXPR = 0x0000_0004_0000_0000 | AST_EXPRESSION,
AST_STATEMENT = 0x0002_0000_0000_0000,
}
bool isA( SyntaxElement leafier, SyntaxElement rootier )
{
SyntaxElement mask = 0;
if ( rootier & 0x0000_0000_FFFF_FFFF )
{
if ( rootier & 0x0000_0000_0000_FFFF )
mask = 0xFFFF_FFFF_FFFF_FFFF;
else
mask = 0xFFFF_FFFF_FFFF_0000;
}
else
{
if ( rootier & 0x0000_FFFF_FFFF_FFFF )
mask = 0xFFFF_FFFF_0000_0000;
else
mask = 0xFFFF_0000_0000_0000;
}
return (leafier & mask) == rootier;
}
unittest
{
assert( isA( AST_EXPRESSION, AST_EXPRESSION) );
assert( isA( AST_NEGATE_EXPR, AST_NEGATE_EXPR) );
assert( isA( AST_NEGATE_EXPR, AST_EXPRESSION) );
assert( isA( AST_NEGATE_EXPR, AST_UNARY_EXPR) );
assert( !isA( AST_EXPRESSION, AST_STATEMENT) );
assert( !isA( AST_NEGATE_EXPR, AST_BINARY_EXPR) );
assert( !isA( AST_NEGATE_EXPR, AST_STATEMENT) );
assert( !isA( AST_NEGATE_EXPR, AST_COMPLIMENT_EXPR) );
assert(false);
}
This approach of course has shameful nesting limitations, but I feel
like these determinations could be fairly well optimized even for the
general case. For example: another approach that I might be more
inclined to take is to give each token/symbol a low-valued index into a
small inheritance table.
I would expect the regex engine to call the isA function as one of it's
operations. Thus placing an AST_EXPRESSION into your expression would
also match an AST_NEGATE_EXPR too.
More information about the Digitalmars-d
mailing list