Writing a JFlex lexer for D - have an issue with cycles

Profile Anaysis via Digitalmars-d digitalmars-d at puremagic.com
Mon Jan 23 15:31:04 PST 2017


On Monday, 23 January 2017 at 01:46:58 UTC, FatalCatharsis wrote:
> On Monday, 23 January 2017 at 00:46:30 UTC, Profile Anaysis 
> wrote:
>> The real issue is ambiguity. Any time you have a cycle you 
>> must be able to get out of it and so your rules must be 
>> organized so that one always checks to see if termination has 
>> occurred before checking for nesting. If you allow, say } as 
>> an element of a stringliteral then it will be ambiguous as the 
>> grammar will try to match it, when it is meant to be a literal.
>
> This brings up another question for me. Isn't the token string 
> production ambiguous by itself?
>
> q{ tokens }
>
> How does it know whether the input "}" is another token or the 
> terminator on a token string?

You know how an if statement in many languages has the feature of 
Short-circuit evaluation?

if (X | Y)

X is first tested, and if it is true, there is no need to test Y.

Now lets say that X is true at some point but Y is "recursive" in 
some sense, while X is false, Y is evaluated but once X becomes 
true there is no need for Y to be evaluated.

This "analogy" is basically what is happening in the grammar for 
nested rules.

As long as we have a terminating rule that is checked first and 
bypasses the non-recursive rules, we will terminate.

So, to understand this easily we think about this way for any 
type of nested things.

1. Create the terminating instance(e.g., the X above) that has no 
recursive properties/non-terminating behavior.

2. Create the recursive instance(the Y).

Then effectively do the if(X | Y). This generally can be directly 
mapped in to a grammar because the if statements are implicit in 
the rules(since it is a decision tree).

So, to "create X", which is simply a string literal, it has the 
structure q{...}.

This structure must be unambiguous for .... This means that ... 
must not have q{ or } in it unless they are escaped in some 
way(which essentially just changes them in to something else. 
e.g., \n is not \ and n but a completely different character as 
far as the grammar is concerned.

Now, that is the terminal rule. It can't recurse because ... will 
be parsed as "whatever" and that whatever can't be, by design(we 
hope), be re-parsed as itself(in any way).


Once we have that, we create the non-terminal recursive rule:

q{...} in which ... can be itself(or something else which then is 
this, or whatever.

You basically have this stuff(you might have to make sure the 
first case actually is a terminal and can't recurse on itself.

Once you have that, it is easy to create a terminating grammar:

Y = q{X | Y}

First X is checked, if it is found then Y terminates, else we try 
Y.

Such a rule as the above can be expanded.

Y = q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | q{X | ... Y}

so, this only allows strings like

q{X}
q{q{X}}
q{q{q{X}}}
q{q{q{q{X}}}}
q{q{q{q{q{X}}}}}

etc...

But they always terminate as long as X terminates.

The compilers logic is this:

Check if the tokens look like q{ then some form of X, if no X 
then check if they match Y, which requires reapplying the whole 
check, then a }.

Suppose our rule is

Y = q{1X | 2Y}

q{1X}
q{2q{1X}}
q{2q{2q{1X}}}
q{2q{2q{2q{1X}}}}
q{2q{2q{2q{2q{1X}}}}}

would be the valid matches.


If we allows other alternates like

Y = q{X | (A | B) }
A = l{Y}
B = ($|%)A

then things like

q{X}

q{l{q{X}}
q{l{q{l{q{X}}}}} (basically A is Y as before but with an l
q{l{q{l{X}}}} <- not valid because l{X} is not in the rule above, 
it must be of type l{Y} and Y always starts with a q{. The rule 
actually fails at the inner most l{l{X}} because the l{X} is not 
matched(no rule)

q{l{q{$l{X}}}} is valid, This is the rules applied in this form 
Y->A->B->Y.


You can see, with the proper rules we can generate just about any 
form we want.

The things to keep track of is that:

X must never contain tokens/matches that allow the other 
non-terminals to fail, unless that is the goal and X must be a 
terminal. In the above example, if X could contain l, {, }, $, 
and/or % then we have no way of knowing if our tokens are really 
X or Y. e.g., with q{l{q{l{q{X}}}}}, X could just be 
l{q{l{q{X}}}} and we would terminate immediately. This would make 
our ability to parse the nesting impossible. We can say that X 
must be "disjoint" from Y for Y the grammar to function as 
intended.

Second, you must list your rules in the short circuit order so 
that the terminate is always checked first. This is because the Y 
rules may actually contain X(they do in fact, and must if it is 
recursive).

Hence, if we check the Y rules first, we will end up in a cycle 
because we will never be able to determine that the Y rule is 
really an X and X is what allows us to terminate the cycle in the 
first place.

It is not really difficult as it seems. It is just a way of 
thinking and expressing things that makes the difference. 
Remember that a grammar is effectively just a giant "if statement 
machine" and so similar logic holds. Just as we can do invalid 
things with if statements we can do similar things with gammars.

The grammars do exactly as we tell them. Maybe we want infinite 
cycles! Maybe want them to terminate only after 150 loops, e.g.,

Y = Y_(n<150) | X

which could be read to use the rule Y for 150 times and then 
switch to rule X. (this is just a sort of spinwait in the 
grammar). e.g., if our grammar was expressing how enemies think 
in video game, this could be seen as a sort of pause in their 
thinking.

If could be used to draw a fractal where we could have several 
parallel grammars workings at the same time and this burns up 
cycles while other grammars do things.

It is a contrived example, of course...

For programming languages, we do not want complexities that we 
can't wrap our mind around(since programming languages are the 
basis for building complex programs it would make the complexity 
too difficult to handle in the long run).

The rule to solve these problems is to use the short-circuit 
concept as then you can list your rules in order from least 
complex to most. The least complex, if a terminal, when then 
always allow your rules to terminate if it is properly matched.


So your job is to construct your terminating string literal by 
knowing what makes the tokens not terminate... then simply make 
sure they can't be in your terminating string literal.

Second, construct your non-terminating rules to be how you want 
the nesting to be.

This is a general procedure. Any "sequence" of things have these 
properties. A literal in a grammar is simply a terminal rule that 
allows all the other rules to terminate at some point. Without 
them, we can't build a structure than can terminate.

e.g., lexing an integer

integer = Digit | (Digit & (Digit | (...) = Digit*

Is a recursive rule. It's just that the recursion as 0 depth in 
some sense or that we use meta rules to help make the rule 
succinct(e.g., the *). Digit, a terminal, terminates the 
recursion.

We could write it as

integer = Digit | DigitSeq
DigitSeq = Digit | DigitSeq

or

integer = Digit | DigitSeq
DigitSeq = Digit | integer


or

integer = Digit | integer

(which, in this case is simple because any integer can be seen as 
two integers concatenated, except for a single digit)

but, of course

integer = integer | Digit

makes no sense because it is effectively an infinite loop(we have 
to try to match some terminal to to get somewhere)

Obviously things can get tricky, but programming languages are 
designed so that ambiguities and complexity is not desirable. 
Programmers are in the business of programming to get a job done. 
If they burn needless cycles(like the spin loop) for no benefit, 
then they are less efficient at their job.

Hopefully that helps some.









































More information about the Digitalmars-d mailing list