Looking for champion - std.lang.d.lex
Sean Kelly
sean at invisibleduck.org
Sat Oct 23 09:44:48 PDT 2010
Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
> On 10/22/10 16:28 CDT, Sean Kelly wrote:
>> Andrei Alexandrescu Wrote:
>>>
>>> I have in mind the entire implementation of a simple design, but
> > > never
>>> had the time to execute on it. The tokenizer would work like this:
>>>
>>> alias Lexer!(
>>> "+", "PLUS",
>>> "-", "MINUS",
>>> "+=", "PLUS_EQ",
>>> ...
>>> "if", "IF",
>>> "else", "ELSE"
>>> ...
>>> ) DLexer;
>>>
>>> Such a declaration generates numeric values DLexer.PLUS etc. and
>>> generates an efficient code that extracts a stream of tokens from a
>>> stream of text. Each token in the token stream has the ID and the
> > > text.
>>
>> What about, say, floating-point literals? It seems like the first
> > element of a pair might have to be a regex pattern.
>
>
> Yah, with regard to such regular patterns (strings, comments, numbers,
> identifiers) there are at least two possibilities that I see:
>
> 1. Go the full route of allowing regexen in the definition. This is
> very hard because you need to generate an efficient (N|D)FA during
> compilation.
>
> 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the
> compile-time table matches, just call onUnrecognizedString(). In
> conjunction with a few simple specialized functions, that makes it
> very simple to define arbitrarily complex lexers where the bulk of the
> work (and the most tedious part) is done by the D compiler.
For the second, that may push the work of recognizing some lexical
elements into the parser. For example, a comment may be defined as /**/,
which if there is no lexical definition of a comment means that it
parses as four distinct valid tokens, div mul mul div.
More information about the Digitalmars-d
mailing list