Looking for champion - std.lang.d.lex

Sat Oct 23 12:46:22 PDT 2010

Sean Kelly <sean at invisibleduck.org> wrote:
> Sean Kelly <sean at invisibleduck.org> wrote:
>> Andrei Alexandrescu <SeeWebsiteForEmail at erdani.org> wrote:
>>> On 10/22/10 16:28 CDT, Sean Kelly wrote:
>>>> Andrei Alexandrescu Wrote:
>>>>> 
>>>>> I have in mind the entire implementation of a simple design, but
>>>>> never
>>>>> had the time to execute on it. The tokenizer would work like this:
>>>>> 
>>>>> alias Lexer!(
>>>>>       "+", "PLUS",
>>>>>       "-", "MINUS",
>>>>>       "+=", "PLUS_EQ",
>>>>>       ...
>>>>>       "if", "IF",
>>>>>       "else", "ELSE"
>>>>>       ...
>>>>> ) DLexer;
>>>>> 
>>>>> Such a declaration generates numeric values DLexer.PLUS etc. and
>>>>> generates an efficient code that extracts a stream of tokens from
> > > > > a
>>>>> stream of text. Each token in the token stream has the ID and the
>>>>> text.
>>>> 
>>>> What about, say, floating-point literals?  It seems like the first
>>>> element of a pair might have to be a regex pattern.
>>> 
>>> 
>>> Yah, with regard to such regular patterns (strings, comments,
>>> numbers,
>>> identifiers) there are at least two possibilities that I see:
>>> 
>>> 1. Go the full route of allowing regexen in the definition. This is
>>> very hard because you need to generate an efficient (N|D)FA during
>>> compilation.
>>> 
>>> 2. Pragmatically allow "fallthrough" routines, i.e. if nothing in
> > > the
>>> compile-time table matches, just call onUnrecognizedString(). In
>>> conjunction with a few simple specialized functions, that makes it
>>> very simple to define arbitrarily complex lexers where the bulk of
>>> the
>>> work (and the most tedious part) is done by the D compiler.
>> 
>> For the second, that may push the work of recognizing some lexical
>> elements into the parser. For example, a comment may be defined as
>> /**/,
>> which if there is no lexical definition of a comment means that it
>> parses as four distinct valid tokens, div mul mul div.
> 
> Or maybe not. A /* could be CommentBegin. I'll have to think on it a
> bit
> more.

I still think it won't work. The stuff inside the comment would come
through as a string of random tokens. Also, the // comment is EOL
sensitive, and this info Ian normally communicated to the parser.