Let's stop parser Hell

Sat Jul 7 15:03:15 PDT 2012

On 07/07/2012 03:13 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 18:55:57 UTC, Chad J wrote:
>
>> In this vision I do not use classes and inheritance for my AST.
>> Instead I use structs that contain some kind of nodeType member that
>> would be one of the tokens/symbols in the grammar, like TOK_WHILE_NODE
>> in the above code. Dynamic dispatch is instead performed by (very
>> fast) DFAs recognizing parts of the AST.
>
> Exactly. This idea first came to me in April after I implemented the
> first top-down recursive descent custom parser for a D subset. I tried
> Visitor pattern before that and wasn't happy. There are some subtle
> difficulties which I believe will be possible to overcome, most
> important being the need to introduce a mechanism for hierarchical
> classification (like a pow expression being an assign expression at the
> same time).
>

 From some of my earlier scribblings:

enum : SyntaxElement
{
   AST_EXPRESSION          = 0x0001_0000_0000_0000,
     AST_UNARY_EXPR        = 0x0000_0001_0000_0000 | AST_EXPRESSION,
       AST_NEGATE_EXPR     = 0x0000_0000_0001_0000 |  AST_UNARY_EXPR,
       AST_COMPLIMENT_EXPR = 0x0000_0000_0002_0000 |  AST_UNARY_EXPR,
       AST_POST_ADD_EXPR   = 0x0000_0000_0003_0000 |  AST_UNARY_EXPR,
       AST_POST_SUB_EXPR   = 0x0000_0000_0004_0000 |  AST_UNARY_EXPR,
       AST_PRE_ADD_EXPR    = 0x0000_0000_0005_0000 |  AST_UNARY_EXPR,
       AST_PRE_SUB_EXPR    = 0x0000_0000_0006_0000 |  AST_UNARY_EXPR,
     AST_BINARY_EXPR       = 0x0000_0002_0000_0000 | AST_EXPRESSION,
       AST_AND_EXPR        = 0x0000_0000_0001_0000 |  AST_BINARY_EXPR,
       AST_OR_EXPR         = 0x0000_0000_0002_0000 |  AST_BINARY_EXPR,
       AST_XOR_EXPR        = 0x0000_0000_0003_0000 |  AST_BINARY_EXPR,
       AST_AND_AND_EXPR    = 0x0000_0000_0004_0000 |  AST_BINARY_EXPR,
       AST_OR_OR_EXPR      = 0x0000_0000_0005_0000 |  AST_BINARY_EXPR,
       AST_ADD_EXPR        = 0x0000_0000_0006_0000 |  AST_BINARY_EXPR,
     AST_TRINARY_EXPR      = 0x0000_0003_0000_0000 | AST_EXPRESSION,
     AST_NARY_EXPR         = 0x0000_0004_0000_0000 | AST_EXPRESSION,
   AST_STATEMENT           = 0x0002_0000_0000_0000,
}

bool isA( SyntaxElement leafier, SyntaxElement rootier )
{
	SyntaxElement mask = 0;

	if ( rootier & 0x0000_0000_FFFF_FFFF )
	{
		if ( rootier & 0x0000_0000_0000_FFFF )
			mask = 0xFFFF_FFFF_FFFF_FFFF;
		else
			mask = 0xFFFF_FFFF_FFFF_0000;
	}
	else
	{
		if ( rootier & 0x0000_FFFF_FFFF_FFFF )
			mask = 0xFFFF_FFFF_0000_0000;
		else
			mask = 0xFFFF_0000_0000_0000;
	}

	return (leafier & mask) == rootier;
}

unittest
{
	assert(  isA( AST_EXPRESSION,  AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_NEGATE_EXPR) );
	assert(  isA( AST_NEGATE_EXPR, AST_EXPRESSION) );
	assert(  isA( AST_NEGATE_EXPR, AST_UNARY_EXPR) );
	assert( !isA( AST_EXPRESSION,  AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_BINARY_EXPR) );
	assert( !isA( AST_NEGATE_EXPR, AST_STATEMENT) );
	assert( !isA( AST_NEGATE_EXPR, AST_COMPLIMENT_EXPR) );
	assert(false);
}

This approach of course has shameful nesting limitations, but I feel 
like these determinations could be fairly well optimized even for the 
general case.  For example: another approach that I might be more 
inclined to take is to give each token/symbol a low-valued index into a 
small inheritance table.

I would expect the regex engine to call the isA function as one of it's 
operations.  Thus placing an AST_EXPRESSION into your expression would 
also match an AST_NEGATE_EXPR too.