std.d.lexer : voting thread

deadalnix deadalnix at gmail.com
Fri Oct 4 21:24:28 PDT 2013


On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu 
wrote:
> Vision
> ======
>
> I'd been following the related discussions for a while, but I 
> have made up my mind today as I was working on a C++ lexer 
> today. The C++ lexer is for Facebook's internal linter. I'm 
> translating the lexer from C++.
>
> Before long I realized two simple things. First, I can't reuse 
> anything from Brian's code (without copying it and doing 
> surgery on it), although it is extremely similar to what I'm 
> doing.
>
> Second, I figured that it is almost trivial to implement a 
> simple, generic, and reusable (across languages and tasks) 
> static trie searcher that takes a compile-time array with all 
> tokens and keywords and returns the token at the front of a 
> range with minimum comparisons.
>
> Such a trie searcher is not intelligent, but is very composable 
> and extremely fast. It is just smart enough to do maximum munch 
> (e.g. interprets "==" and "foreach" as one token each, not 
> two), but is not smart enough to distinguish an identifier 
> "whileTrue" from the keyword "while" (it claims "while" was 
> found and stops right at the beginning of "True" in the 
> stream). This is for generality so applications can define how 
> identifiers work (e.g. Lisp allows "-" in identifiers but D 
> doesn't etc). The trie finder doesn't do numbers or comments 
> either. No regexen of any kind.
>
> The beauty of it all is that all of these more involved bits 
> (many of which are language specific) can be implemented 
> modularly and trivially as a postprocessing step after the trie 
> finder. For example the user specifies "/*" as a token to the 
> trie finder. Whenever a comment starts, the trie finder will 
> find and return it; then the user implements the alternate 
> grammar of multiline comments.
>

That is more or less how SDC's lexer works. You pass it 2AA : one 
with string associated with tokens type, and one with string to 
function's name that return the actual token (for instance to 
handle /*) and finally one when nothing matches.

A giant 3 headed monster mixin is created from these data.

That has been really handy so far.

> If what we need at this point is a conventional lexer for the D 
> language, std.d.lexer is the ticket. But I think it wouldn't be 
> difficult to push our ambitions way beyond that. What say you?
>

Yup, I do agree.


More information about the Digitalmars-d mailing list