Request for comments: std.d.lexer

Sun Jan 27 02:17:28 PST 2013

On Sun, Jan 27, 2013 at 10:51 AM, Brian Schott <briancschott at gmail.com> wrote:
> I'm writing a D lexer for possible inclusion in Phobos.
>
> DDOC: http://hackerpilot.github.com/experimental/std_lexer/phobos/lexer.html
> Code:
> https://github.com/Hackerpilot/Dscanner/blob/range-based-lexer/std/d/lexer.d

Cool! I remember linking to it in the wiki a week ago:
Here:

http://wiki.dlang.org/Lexers_Parsers

Feel free to correct the entry.

> It's currently able to correctly syntax highlight all of Phobos, but does a
> fairly bad job at rejecting or notifying users/callers about invalid input.
>
> I'd like to hear arguments on the various ways to handle errors in the
> lexer. In a compiler it would be useful to throw an exception on finding
> something like a string literal that doesn't stop before EOF, but a text
> editor or IDE would probably want to be a bit more lenient. Maybe having it
> run-time (or compile-time configurable) like std.csv would be the best
> option here.

Last time we discussed it, IIRC, some people wanted the lexer to stop
at once, other just wanted an Error token.
I personally prefer an Error token, but that means finding a way to
start lexing again after the error (and hence, finding where the error
ends).
I guess any separator/terminator could be used to re-engage the lexer:
space, semicolon, closing brace, closing parenthesis?

> I'm interested in ideas on the API design and other high-level issues at the
> moment. I don't consider this ready for inclusion. (The current module being
> reviewed for inclusion in Phobos is the new std.uni.)

OK, here are a few questions:

* Having a range interface is good. Any reason why you made byToken a
class and not a struct? Most (like, 99%) of range in Phobos are
structs. Do you need reference semantics?

* Also, is there a way to keep comments? Any code wanting the modify
the code might need them.
(edit: Ah, I see it: IterationStyle.IncludeComments)

* I'd distinguish between standard comments and documentation
comments. These are different beasts, to my eyes.

* I see Token has a startIndex member. Any reason not to have a
endIndex member? Or can and end index always be deduced from
startIndex and value.length?

* How does it fare with non ASCII code?

* A rough estimate of number of tokens/s would be good (I know it'll
vary). Walter seems to think if a lexer is not able to vomit thousands
of tokens a seconds, then it's not good. On a related note, does your
lexer have any problem with 10k+-lines files?