std.d.lexer : voting thread
deadalnix
deadalnix at gmail.com
Fri Oct 4 21:24:28 PDT 2013
On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu
wrote:
> Vision
> ======
>
> I'd been following the related discussions for a while, but I
> have made up my mind today as I was working on a C++ lexer
> today. The C++ lexer is for Facebook's internal linter. I'm
> translating the lexer from C++.
>
> Before long I realized two simple things. First, I can't reuse
> anything from Brian's code (without copying it and doing
> surgery on it), although it is extremely similar to what I'm
> doing.
>
> Second, I figured that it is almost trivial to implement a
> simple, generic, and reusable (across languages and tasks)
> static trie searcher that takes a compile-time array with all
> tokens and keywords and returns the token at the front of a
> range with minimum comparisons.
>
> Such a trie searcher is not intelligent, but is very composable
> and extremely fast. It is just smart enough to do maximum munch
> (e.g. interprets "==" and "foreach" as one token each, not
> two), but is not smart enough to distinguish an identifier
> "whileTrue" from the keyword "while" (it claims "while" was
> found and stops right at the beginning of "True" in the
> stream). This is for generality so applications can define how
> identifiers work (e.g. Lisp allows "-" in identifiers but D
> doesn't etc). The trie finder doesn't do numbers or comments
> either. No regexen of any kind.
>
> The beauty of it all is that all of these more involved bits
> (many of which are language specific) can be implemented
> modularly and trivially as a postprocessing step after the trie
> finder. For example the user specifies "/*" as a token to the
> trie finder. Whenever a comment starts, the trie finder will
> find and return it; then the user implements the alternate
> grammar of multiline comments.
>
That is more or less how SDC's lexer works. You pass it 2AA : one
with string associated with tokens type, and one with string to
function's name that return the actual token (for instance to
handle /*) and finally one when nothing matches.
A giant 3 headed monster mixin is created from these data.
That has been really handy so far.
> If what we need at this point is a conventional lexer for the D
> language, std.d.lexer is the ticket. But I think it wouldn't be
> difficult to push our ambitions way beyond that. What say you?
>
Yup, I do agree.
More information about the Digitalmars-d
mailing list