Looking for champion - std.lang.d.lex
Nick Sabalausky
a at a.a
Sat Oct 23 14:39:31 PDT 2010
"Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message
news:i9v8vq$2gvh$1 at digitalmars.com...
> On 10/23/10 11:44 CDT, Sean Kelly wrote:
>> Andrei Alexandrescu<SeeWebsiteForEmail at erdani.org> wrote:
>>> On 10/22/10 16:28 CDT, Sean Kelly wrote:
>>>> Andrei Alexandrescu Wrote:
>>>>>
>>>>> I have in mind the entire implementation of a simple design, but
>>>>> never
>>>>> had the time to execute on it. The tokenizer would work like this:
>>>>>
>>>>> alias Lexer!(
>>>>> "+", "PLUS",
>>>>> "-", "MINUS",
>>>>> "+=", "PLUS_EQ",
>>>>> ...
>>>>> "if", "IF",
>>>>> "else", "ELSE"
>>>>> ...
>>>>> ) DLexer;
>>>>>
>>>>> Such a declaration generates numeric values DLexer.PLUS etc. and
>>>>> generates an efficient code that extracts a stream of tokens from a
>>>>> stream of text. Each token in the token stream has the ID and the
>>>>> text.
>>>>
>>>> What about, say, floating-point literals? It seems like the first
>>>> element of a pair might have to be a regex pattern.
>>>
>>>
>>> Yah, with regard to such regular patterns (strings, comments, numbers,
>>> identifiers) there are at least two possibilities that I see:
>>>
>>> 1. Go the full route of allowing regexen in the definition. This is
>>> very hard because you need to generate an efficient (N|D)FA during
>>> compilation.
>>>
>>> 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the
>>> compile-time table matches, just call onUnrecognizedString(). In
>>> conjunction with a few simple specialized functions, that makes it
>>> very simple to define arbitrarily complex lexers where the bulk of the
>>> work (and the most tedious part) is done by the D compiler.
>>
>> For the second, that may push the work of recognizing some lexical
>> elements into the parser. For example, a comment may be defined as /**/,
>> which if there is no lexical definition of a comment means that it
>> parses as four distinct valid tokens, div mul mul div.
>
> I was thinking comments could be easily caught by simple routines:
>
> alias Lexer!(
> "+", "PLUS",
> "-", "MINUS",
> "+=", "PLUS_EQ",
> ...
> "/*", q{parseNonNestedComment("*/")},
> "/+", q{parseNestedComment("+/")},
> "//", q{parseOneLineComment()},
> ...
> "if", "IF",
> "else", "ELSE",
> ...
> ) DLexer;
>
> During compilation, such non-tokens are recognized as code by the lexer
> generator and called appropriately. A comprehensive library of such
> routines completes a useful library.
>
What's wrong with regexes? That's pretty typical for lexers.
More information about the Digitalmars-d
mailing list