Writing a JFlex lexer for D - have an issue with cycles
Profile Anaysis via Digitalmars-d
digitalmars-d at puremagic.com
Sun Jan 22 16:46:30 PST 2017
On Sunday, 22 January 2017 at 22:11:08 UTC, FatalCatharsis wrote:
> I'm writing a flex lexer for D and I've hit a roadblock. It is
> almost working EXCEPT for one specific production.
>
> StringLiteral is cyclic and I don't know how to approach it. It
> is cyclic because:
>
> Token -> StringLiteral -> TokenString -> Token
>
> To break the cycle, I was thinking I could just make a
> production which is Token sans StringLiteral and instead subbed
> with a production for StringLiteral that does not contain
> TokenString, but that fundamentally changes the language.
> Should the lexer really handle something like:
>
> q{blah1q{20q{"meh"q{20.1q{blah}}}}}
>
> Lexically I don't know how this makes sense. To be clear, I'm
> wondering if this is acceptable:
>
> Token:
> Identifier
> StringLiteral
> CharacterLiteral
> IntegerLiteral
> FloatLiteral
> Keyword
> Operator
>
> StringLiteral:
> WysiwygString
> AlternateWysiwygString
> DoubleQuotedString
> HexString
> DelimitedString
> TokenString
>
> TokenString:
> q{ TokenNonNestedTokenStrings }
>
>
> TokenNonNestedTokenStrings:
> TokenNonNestedTokenString
> TokenNonNestedTokenString TokenNonNestedTokenStrings
>
> TokenNonNestedTokenString:
> Identifier
> StringLiteralNonNestedTokenString
> CharacterLiteral
> IntegerLiteral
> FloatLiteral
> Keyword
> Operator
>
> StringLiteralNonNestedTokenString:
> WysiwygString
> AlternateWysiwygString
> DoubleQuotedString
> HexString
> DelimitedString
>
> Which basically disables nested token strings. Has anyone else
> run into this issue?
This is not really any different than any nesting problem.
First, you must realize that it is more like a "spiral" than a
circle. It is recursive in this sense:
Token_n -> StringLiteral_n -> TokenString_n -> Token_(n-1)
and eventually Token_(n-1) must terminate the chain(essentially
the rule must fail for some n).
But this is ok because parsers are recursive in nature, so you
just have to have your rule be able to terminate in a logical way.
What we can say about q{} string literal is that either contains
other string literals or it doesn't.
If it doesn't, that is all that is required because a nested set
of string literals will have to contain at least one that is not
nested. That collapses the whole chain.
So, the way I see it is that you really need your TokenString to
have a nested and non-nested version. The nested version will
cycle but also as an out.
e.g.,
> Token:
> Identifier
> StringLiteral
> CharacterLiteral
> IntegerLiteral
> FloatLiteral
> Keyword
> Operator
>
> StringLiteral:
> WysiwygString
> AlternateWysiwygString
> DoubleQuotedString
> HexString
> DelimitedString
> TokenString
>
> TokenString:
> q{StringLiteralNonNestedTokenString} <- Terminal
q{NestedTokenString}
>
> NestedTokenString:
Tokens
TokenString
Tokens
>
> StringLiteralNonNestedTokenString:
> WysiwygString
> AlternateWysiwygString
> DoubleQuotedString
> HexString
> DelimitedString
!q{
!}
So, a TokenString can be a a "simple string" that doesn't nest or
it can have nesting. That nesting can go on for every if the
source has it, and this grammar can handle it.
But If a tokenstring, regardless if it is inside another token
string, terminates/is not nesting another token string, then you
are fine too because that catches the eventual termination and
will back propagate through the nesting to determine the other
tokens.
The real issue is ambiguity. Any time you have a cycle you must
be able to get out of it and so your rules must be organized so
that one always checks to see if termination has occurred before
checking for nesting. If you allow, say } as an element of a
stringliteral then it will be ambiguous as the grammar will try
to match it, when it is meant to be a literal.
I'm not 100% sure about your grammar expression but basically
q{blah1 <- q{NestedTokenString}/nonterminal
q{20 <- q{NestedTokenString}/nonterminal
q{"meh" <- q{NestedTokenString}/nonterminal
q{20.1 <- q{NestedTokenString}/nonterminal
q{blah} <-
q{StringLiteralNonNestedTokenString}/terminal
}
}
}
}
Basically as we parse the string, we encounter q{'s and this
cause a "cycle"(but it really is a spiral ;).
When we get to the final one we see that it is a terminal because
it doesn't contain any q{ inside.
Hence at that point the grammar unwinds and builds the tree quite
easily.
Again, it is really no different than ()'s and such.
One just has to make sure that, say, ( and ) not sued for
anything else in an ambiguous way.
e.g., (3+)) means what? (suppose ) also was the same as 4. so
someone things (3+4) = 7 but the compiler cannot distinguish
between them.
(it could be made to in some ways but the grammar is then not
context free)
You shouldn't have to worry too much about those cases but you do
have to make sure your grammar understands that the tokens it
uses to determine a rule cannot be interpreted in any other way
along that parsed branch. Else you end up with an ambiguous or
context sensitive grammar.
More information about the Digitalmars-d
mailing list