Let's stop parser Hell

Sat Jul 7 15:44:57 PDT 2012

On 07/07/2012 06:18 PM, Roman D. Boiko wrote:
> On Saturday, 7 July 2012 at 22:03:20 UTC, Chad J wrote:
>> enum : SyntaxElement
>> {
>> AST_EXPRESSION = 0x0001_0000_0000_0000,
>> AST_UNARY_EXPR = 0x0000_0001_0000_0000 |
>
> This would cause wasting space (probably a lot). Definitely it would not
> be easy to implement in a parser generator, when various language
> properties are not known beforehand for fine-grained tuning.
>
>> This approach of course has shameful nesting limitations, but I feel
>> like these determinations could be fairly well optimized even for the
>> general case. For example: another approach that I might be more
>> inclined to take is to give each token/symbol a low-valued index into
>> a small inheritance table.
>
> Depending on implementation, that might introduce the multiplier
> overhead of table access per each comparison (and there would be many in
> case of searching for nodes of specific type).
>
>> I would expect the regex engine to call the isA function as one of
>> it's operations. Thus placing an AST_EXPRESSION into your expression
>> would also match an AST_NEGATE_EXPR too.
>
> But actually it is not so difficult to implement in a very similar way
> to what you described. I was thinking about a lookup table, but
> different from a traditional inheritance table. It would be indexed by
> AST node type (integral enum value), and store various classification
> information as bits. Maybe this is what you meant and I misunderstood
> you... Example is here:
> https://github.com/roman-d-boiko/dct/blob/May2012drafts/fe/core.d
> (sorry, it doesn't show how to do classification, and has a different
> context, but I hope you get the idea). The advantage over storing
> hierarchical information directly in each token is obviously memory usage.

I see what you mean.

I'm not sure that I buy that language properties are known before-hand. 
  The front-end knows what the language grammar looks like and it knows 
what kind of things it can find in an AST.

Thus you can make the regex DSL do this transformation and let the DFAs 
handle everything:

expr -> (expr | unary_expr | negate_expr | binary_compliment | incr_expr 
| decr_expr | binary_expr | assign_expr | add_assign_expr | nary_expr | ...)

Because what we really want it to do is match any of those expression 
kinds when we place the expression symbol in our regex.

I think the important point here though is that inheritance can be 
reduced to bit-twiddling optimizations in this case.