std.d.lexer requirements

Walter Bright newshound2 at digitalmars.com
Wed Aug 1 17:10:07 PDT 2012


Given the various proposals for a lexer module for Phobos, I thought I'd share 
some characteristics it ought to have.

First of all, it should be suitable for, at a minimum:

1. compilers

2. syntax highlighting editors

3. source code formatters

4. html creation

To that end:

1. It should accept as input an input range of UTF8. I feel it is a mistake to 
templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use 
an 'adapter' range to convert the input to UTF8. (This is what component 
programming is all about.)

2. It should output an input range of tokens

3. tokens should be values, not classes

4. It should avoid memory allocation as much as possible

5. It should read or write any mutable global state outside of its "Lexer"
instance

6. A single "Lexer" instance should be able to serially accept input ranges, 
sharing and updating one identifier table

7. It should accept a callback delegate for errors. That delegate should decide 
whether to:
    1. ignore the error (and "Lexer" will try to recover and continue)
    2. print an error message (and "Lexer" will try to recover and continue)
    3. throw an exception, "Lexer" is done with that input range

8. Lexer should be configurable as to whether it should collect information 
about comments and ddoc comments or not

9. Comments and ddoc comments should be attached to the next following token, 
they should not themselves be tokens

10. High speed matters a lot

11. Tokens should have begin/end line/column markers, though most of the time 
this can be implicitly determined

12. It should come with unittests that, using -cov, show 100% coverage


Basically, I don't want anyone to be motivated to do a separate one after seeing 
this one.


More information about the Digitalmars-d mailing list