std.d.lexer requirements

Sun Aug 5 00:59:40 PDT 2012

To help with performance comparisons I ripped dmd's lexer out and got it building as a few .d files.  It's very crude.
It's got tons of casts (more than the original c++ version).  I attempted no cleanup or any other change than the
minimum I could to get it to build and run.  Obviously there's tons of room for cleanup, but that's not the point...
it's just useful as a baseline.

The branch:
    https://github.com/braddr/phobos/tree/dmd_lexer

The commit with the changes:
    https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab

On my desktop (far from idle, it's running 2 of the auto testers), it consistently takes 0.187s to lex all of the .d
files in phobos.

Later,
Brad

On 8/1/2012 5:10 PM, Walter Bright wrote:
> Given the various proposals for a lexer module for Phobos, I thought I'd share some characteristics it ought to have.
> 
> First of all, it should be suitable for, at a minimum:
> 
> 1. compilers
> 
> 2. syntax highlighting editors
> 
> 3. source code formatters
> 
> 4. html creation
> 
> To that end:
> 
> 1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone
> desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component
> programming is all about.)
> 
> 2. It should output an input range of tokens
> 
> 3. tokens should be values, not classes
> 
> 4. It should avoid memory allocation as much as possible
> 
> 5. It should read or write any mutable global state outside of its "Lexer"
> instance
> 
> 6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier table
> 
> 7. It should accept a callback delegate for errors. That delegate should decide whether to:
>    1. ignore the error (and "Lexer" will try to recover and continue)
>    2. print an error message (and "Lexer" will try to recover and continue)
>    3. throw an exception, "Lexer" is done with that input range
> 
> 8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not
> 
> 9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokens
> 
> 10. High speed matters a lot
> 
> 11. Tokens should have begin/end line/column markers, though most of the time this can be implicitly determined
> 
> 12. It should come with unittests that, using -cov, show 100% coverage
> 
> 
> Basically, I don't want anyone to be motivated to do a separate one after seeing this one.