Looking for champion - std.lang.d.lex

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Fri Oct 22 12:48:49 PDT 2010


On 10/22/10 14:02 CDT, Tomek Sowiński wrote:
> Dnia 22-10-2010 o 00:01:21 Walter Bright <newshound2 at digitalmars.com>
> napisał(a):
>
>> As we all know, tool support is important for D's success. Making
>> tools easier to build will help with that.
>>
>> To that end, I think we need a lexer for the standard library -
>> std.lang.d.lex. It would be helpful in writing color syntax
>> highlighting filters, pretty printers, repl, doc generators, static
>> analyzers, and even D compilers.
>>
>> It should:
>>
>> 1. support a range interface for its input, and a range interface for
>> its output
>> 2. optionally not generate lexical errors, but just try to recover and
>> continue
>> 3. optionally return comments and ddoc comments as tokens
>> 4. the tokens should be a value type, not a reference type
>> 5. generally follow along with the C++ one so that they can be
>> maintained in tandem
>>
>> It can also serve as the basis for creating a javascript
>> implementation that can be embedded into web pages for syntax
>> highlighting, and eventually an std.lang.d.parse.
>>
>> Anyone want to own this?
>
> Interesting idea. Here's another: D will soon need bindings for CORBA,
> Thrift, etc, so lexers will have to be written all over to grok
> interface files. Perhaps a generic tokenizer which can be parametrized
> with a lexical grammar would bring more ROI, I got a hunch D's templates
> are strong enough to pull this off without any source code generation
> ala JavaCC. The books I read on compilers say tokenization is a solved
> problem, so the theory part on what a good abstraction should be is
> done. What you think?

Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer 
generator.

I have in mind the entire implementation of a simple design, but never 
had the time to execute on it. The tokenizer would work like this:

alias Lexer!(
     "+", "PLUS",
     "-", "MINUS",
     "+=", "PLUS_EQ",
     ...
     "if", "IF",
     "else", "ELSE"
     ...
) DLexer;

Such a declaration generates numeric values DLexer.PLUS etc. and 
generates an efficient code that extracts a stream of tokens from a 
stream of text. Each token in the token stream has the ID and the text.

Comments, strings etc. can be handled in one of several ways but that's 
a longer discussion.

The undertaking is doable but nontrivial.


Andrei


More information about the Digitalmars-d mailing list