Looking for champion - std.lang.d.lex

Tue Oct 26 21:01:20 PDT 2010

"Walter Bright" <newshound2 at digitalmars.com> wrote in message 
news:ia8321$vug$1 at digitalmars.com...
> Nick Sabalausky wrote:
>> "Walter Bright" <newshound2 at digitalmars.com> wrote in message 
>> news:i9qd8q$1ls4$1 at digitalmars.com...
>>> 4. the tokens should be a value type, not a reference type
>>
>> I'm curious, is your reason for this purely to avoid allocations during 
>> lexing, or are there other reasons too?
>
> It's one big giant reason. Storage allocation gets unbelievably costly in 
> a lexer. Another is it makes tokens easy to copy. Another one is that 
> classes are for polymorphic behavior. What kind of polymorphic behavior 
> would one want with tokens?
>

Honestly, I'm not entirely certain whether or not Goldie actually needs its 
tokens to be classes instead of structs, but I'll explain my current usage:

In the basic "dynamic" style, every token in Goldie, terminal or 
nonterminal, is of type "class Token" no matter what its symbol or part of 
grammar it represents. But Goldie has an optional "static" style which 
creates a class hierarchy, for the sake of compile-time type safety, with 
"Token" as the base.

Suppose you have a grammar named "simple" that's a series of one or more a's 
and b's separated by plus signs:

<Item> ::= 'a' | 'b'
<List> ::= <List> '+' <Item> | <Item>

So there's three terminals: "a", "b", and "+"
And two nonterminals: "<Item>" and "<List>", each with exactly two possible 
reductions.

So in dynamic-style, all of those are "class Token", and that's all that 
exists. But with the optional static-style, the following class hierarchy is 
effectively created:

class Token;
class Token_simple : Token;

class Token_simple!"a" : Token_simple;
class Token_simple!"b" : Token_simple;
class Token_simple!"+" : Token_simple;
class Token_simple!"<Item>" : Token_simple;
class Token_simple!"<List>" : Token_simple;

class Token_simple!("<Item>", "a") : Token_simple!"<Item>";
class Token_simple!("<Item>", "b") : Token_simple!"<Item>";

class Token_simple!("<List>", "<List>", "+", "<Item>") : 
Token_simple!"<List>";
class Token_simple!("<List>", "<Item>") : Token_simple!"<List>";

So rules inherit from the nonterminal they reduce to; terminals and 
nonterminals inherit from a dummy class dedicated specifically to the given 
grammar; and that inherits from plain old dynamic-style "Token". All of 
those template parameters are validated at compile-time.

(At some point I'd like to make it possible to specify the rule-based tokens 
as something like:
Token!("<List> ::= <List> '+' <Item>"), but I haven't gotten to it yet.)

There's one more trick: The plain old Token exposes a member "subX" which 
can be numerically indexed to obtain the sub-tokens (for terminals, 
subX.length==0):

void foo(Token token)
{
    if(token.matches("<List>", "<List>", "+", "<Item>"))
    {
        auto leftSide = token.subX[0];
        auto rightSide = token.subX[2];
        //auto dummy = token.subX[10]; // Run-time error

        static assert(is(typeof(leftSide) == Token));
        static assert(is(typeof(rightSide) == Token));
    }
}

Note that it's impossible for the "matches" function to verify at 
compile-time that its arguments are valid.

All of the static-style token types retain all of that for total 
compatibility with the dynamic-style. But the static-style nonterminals 
provide an additional member, "sub", that can be used like this:

void foo(Token_simple!("<List>", "<List>", "+", "<Item>") token)
{
    auto leftSide = token.sub!0;
    auto rightSide = token.sub!2;
    //auto dummy = token.sub!10; // Compile-time error

    static assert(is(typeof(leftSide) == Token_simple!"<List>"));
    static assert(is(typeof(rightSide) == Token_simple!"<Item>"));
}

As for whether or not this effect can be reasonably accomplished with 
structs: I have no idea, I haven't really looked into it.