Writing a JFlex lexer for D - have an issue with cycles

Sun Jan 22 16:46:30 PST 2017

On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:
> I'm writing a flex lexer for D and I've hit a roadblock. It is 
> almost working EXCEPT for one specific production.
>
> StringLiteral is cyclic and I don't know how to approach it. It 
> is cyclic because:
>
>      Token -> StringLiteral -> TokenString -> Token
>
> To break the cycle, I was thinking I could just make a 
> production which is Token sans StringLiteral and instead subbed 
> with a production for StringLiteral that does not contain 
> TokenString, but that fundamentally changes the language. 
> Should the lexer really handle something like:
>
>     q{blah1q{20q{"meh"q{20.1q{blah}}}}}
>
> Lexically I don't know how this makes sense. To be clear, I'm 
> wondering if this is acceptable:
>
>     Token:
>         Identifier
>         StringLiteral
>         CharacterLiteral
>         IntegerLiteral
>         FloatLiteral
>         Keyword
>         Operator
>
>      StringLiteral:
>         WysiwygString
>         AlternateWysiwygString
>         DoubleQuotedString
>         HexString
>         DelimitedString
>         TokenString
>
>      TokenString:
>         q{ TokenNonNestedTokenStrings }
>
>
>      TokenNonNestedTokenStrings:
>         TokenNonNestedTokenString
>         TokenNonNestedTokenString TokenNonNestedTokenStrings
>
>      TokenNonNestedTokenString:
>         Identifier
>         StringLiteralNonNestedTokenString
>         CharacterLiteral
>         IntegerLiteral
>         FloatLiteral
>         Keyword
>         Operator
>
>      StringLiteralNonNestedTokenString:
>         WysiwygString
>         AlternateWysiwygString
>         DoubleQuotedString
>         HexString
>         DelimitedString
>
> Which basically disables nested token strings. Has anyone else 
> run into this issue?

This is not really any different than any nesting problem.

First, you must realize that it is more like a "spiral" than a 
circle. It is recursive in this sense:

       Token_n -> StringLiteral_n -> TokenString_n -> Token_(n-1)

and eventually Token_(n-1) must terminate the chain(essentially 
the rule must fail for some n).

But this is ok because parsers are recursive in nature, so you 
just have to have your rule be able to terminate in a logical way.

What we can say about q{} string literal is that either contains 
other string literals or it doesn't.

If it doesn't, that is all that is required because a nested set 
of string literals will have to contain at least one that is not 
nested. That collapses the whole chain.

So, the way I see it is that you really need your TokenString to 
have a nested and non-nested version. The nested version will 
cycle but also as an out.

e.g.,
>     Token:
>         Identifier
>         StringLiteral
>         CharacterLiteral
>         IntegerLiteral
>         FloatLiteral
>         Keyword
>         Operator
>
>      StringLiteral:
>         WysiwygString
>         AlternateWysiwygString
>         DoubleQuotedString
>         HexString
>         DelimitedString
>         TokenString
>
>      TokenString:
>         q{StringLiteralNonNestedTokenString}  <- Terminal
           q{NestedTokenString}

>
>      NestedTokenString:
           Tokens
           TokenString
           Tokens
>
>      StringLiteralNonNestedTokenString:
>         WysiwygString
>         AlternateWysiwygString
>         DoubleQuotedString
>         HexString
>         DelimitedString
           !q{
           !}

So, a TokenString can be a a "simple string" that doesn't nest or 
it can have nesting. That nesting can go on for every if the 
source has it, and this grammar can handle it.

But If a tokenstring, regardless if it is inside another token 
string, terminates/is not nesting another token string, then you 
are fine too because that catches the eventual termination and 
will back propagate through the nesting to determine the other 
tokens.

The real issue is ambiguity. Any time you have a cycle you must 
be able to get out of it and so your rules must be organized so 
that one always checks to see if termination has occurred before 
checking for nesting. If you allow, say } as an element of a 
stringliteral then it will be ambiguous as the grammar will try 
to match it, when it is meant to be a literal.

I'm not 100% sure about your grammar expression but basically

q{blah1 <- q{NestedTokenString}/nonterminal
        q{20 <- q{NestedTokenString}/nonterminal
            q{"meh" <- q{NestedTokenString}/nonterminal
                   q{20.1 <- q{NestedTokenString}/nonterminal
                         q{blah} <- 
q{StringLiteralNonNestedTokenString}/terminal
                     }
              }
         }
  }

Basically as we parse the string, we encounter q{'s and this 
cause a "cycle"(but it really is a spiral ;).

When we get to the final one we see that it is a terminal because 
it doesn't contain any q{ inside.

Hence at that point the grammar unwinds and builds the tree quite 
easily.

Again, it is really no different than ()'s and such.

One just has to make sure that, say, ( and ) not sued for 
anything else in an ambiguous way.

e.g., (3+)) means what? (suppose ) also was the same as 4. so 
someone things (3+4) = 7 but the compiler cannot distinguish 
between them.

(it could be made to in some ways but the grammar is then not 
context free)

You shouldn't have to worry too much about those cases but you do 
have to make sure your grammar understands that the tokens it 
uses to determine a rule cannot be interpreted in any other way 
along that parsed branch. Else you end up with an ambiguous or 
context sensitive grammar.