Looking for champion - std.lang.d.lex

Sat Oct 23 11:18:56 PDT 2010

On 10/23/10 11:44 CDT, Sean Kelly wrote:
> Andrei Alexandrescu<SeeWebsiteForEmail at erdani.org>  wrote:
>> On 10/22/10 16:28 CDT, Sean Kelly wrote:
>>> Andrei Alexandrescu Wrote:
>>>>
>>>> I have in mind the entire implementation of a simple design, but
>>>> never
>>>> had the time to execute on it. The tokenizer would work like this:
>>>>
>>>> alias Lexer!(
>>>>        "+", "PLUS",
>>>>        "-", "MINUS",
>>>>        "+=", "PLUS_EQ",
>>>>        ...
>>>>        "if", "IF",
>>>>        "else", "ELSE"
>>>>        ...
>>>> ) DLexer;
>>>>
>>>> Such a declaration generates numeric values DLexer.PLUS etc. and
>>>> generates an efficient code that extracts a stream of tokens from a
>>>> stream of text. Each token in the token stream has the ID and the
>>>> text.
>>>
>>> What about, say, floating-point literals?  It seems like the first
>>> element of a pair might have to be a regex pattern.
>>
>>
>> Yah, with regard to such regular patterns (strings, comments, numbers,
>> identifiers) there are at least two possibilities that I see:
>>
>> 1. Go the full route of allowing regexen in the definition. This is
>> very hard because you need to generate an efficient (N|D)FA during
>> compilation.
>>
>> 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the
>> compile-time table matches, just call onUnrecognizedString(). In
>> conjunction with a few simple specialized functions, that makes it
>> very simple to define arbitrarily complex lexers where the bulk of the
>> work (and the most tedious part) is done by the D compiler.
>
> For the second, that may push the work of recognizing some lexical
> elements into the parser. For example, a comment may be defined as /**/,
> which if there is no lexical definition of a comment means that it
> parses as four distinct valid tokens, div mul mul div.

I was thinking comments could be easily caught by simple routines:

alias Lexer!(
        "+", "PLUS",
        "-", "MINUS",
        "+=", "PLUS_EQ",
        ...
        "/*", q{parseNonNestedComment("*/")},
        "/+", q{parseNestedComment("+/")},
        "//", q{parseOneLineComment()},
        ...
        "if", "IF",
        "else", "ELSE",
        ...
) DLexer;

During compilation, such non-tokens are recognized as code by the lexer 
generator and called appropriately. A comprehensive library of such 
routines completes a useful library.

Andrei