std.d.lexer : voting thread
Joseph Rushton Wakeling
joseph.wakeling at webdrake.net
Sun Oct 6 05:40:47 PDT 2013
On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu
wrote:
> On 10/2/13 7:41 AM, Dicebot wrote:
>> After brief discussion with Brian and gathering data from the
>> review
>> thread, I have decided to start voting for `std.d.lexer`
>> inclusion into
>> Phobos.
>
> Thanks all involved for the work, first of all Brian.
>
> I have the proverbial good news and bad news. The only bad news
> is that I'm voting "no" on this proposal.
>
> But there's plenty of good news.
>
> 1. I am not attempting to veto this, so just consider it a
> normal vote when tallying.
>
> 2. I do vote for inclusion in the /etc/ package for the time
> being.
>
> 3. The work is good and the code valuable, so even in the case
> my suggestions (below) will be followed, a virtually all code
> pulp that gets work done can be reused.
>
> Vision
> ======
>
> I'd been following the related discussions for a while, but I
> have made up my mind today as I was working on a C++ lexer
> today. The C++ lexer is for Facebook's internal linter. I'm
> translating the lexer from C++.
>
> Before long I realized two simple things. First, I can't reuse
> anything from Brian's code (without copying it and doing
> surgery on it), although it is extremely similar to what I'm
> doing.
>
> Second, I figured that it is almost trivial to implement a
> simple, generic, and reusable (across languages and tasks)
> static trie searcher that takes a compile-time array with all
> tokens and keywords and returns the token at the front of a
> range with minimum comparisons.
>
> Such a trie searcher is not intelligent, but is very composable
> and extremely fast. It is just smart enough to do maximum munch
> (e.g. interprets "==" and "foreach" as one token each, not
> two), but is not smart enough to distinguish an identifier
> "whileTrue" from the keyword "while" (it claims "while" was
> found and stops right at the beginning of "True" in the
> stream). This is for generality so applications can define how
> identifiers work (e.g. Lisp allows "-" in identifiers but D
> doesn't etc). The trie finder doesn't do numbers or comments
> either. No regexen of any kind.
>
> The beauty of it all is that all of these more involved bits
> (many of which are language specific) can be implemented
> modularly and trivially as a postprocessing step after the trie
> finder. For example the user specifies "/*" as a token to the
> trie finder. Whenever a comment starts, the trie finder will
> find and return it; then the user implements the alternate
> grammar of multiline comments.
>
> To encode the tokens returned by the trie, we must do away with
> definitions such as
>
> enum TokenType : ushort { invalid, assign, ... }
>
> These are fine for a tokenizer written in C, but are needless
> duplication from a D perspective. I think a better approach is:
>
> struct TokenType {
> string symbol;
> ...
> }
>
> TokenType tok(string s)() {
> static immutable string interned = s;
> return TokenType(interned);
> }
>
> Instead of associating token types with small integers, we
> associate them with string addresses. (For efficiency we may
> use pointers to zero-terminated strings, but I don't think
> that's necessary). Token types are interned by design, i.e. to
> compare two tokens for equality it suffices to compare the
> strings with "is" (this can be extended to general identifiers,
> not only statically-known tokens). Then, each token type has a
> natural representation that doesn't require the user to
> remember the name of the token. The left shift token is simply
> tok!"<<" and is application-global.
>
> The static trie finder does not even build a trie - it simply
> generates a bunch of switch statements. The signature I've used
> is:
>
> Tuple!(size_t, size_t, Token)
> staticTrieFinder(alias TokenTable, R)(R r) {
>
> It returns a tuple with (a) whitespace characters before token,
> (b) newlines before token, and (c) the token itself, returned
> as tok!"whatever". To use for C++:
>
> alias CppTokenTable = TypeTuple!(
> "~", "(", ")", "[", "]", "{", "}", ";", ",", "?",
> "<", "<<", "<<=", "<=", ">", ">>", ">>=", "%", "%=", "=",
> "==", "!", "!=",
> "^", "^=", "*", "*=",
> ":", "::", "+", "++", "+=", "&", "&&", "&=", "|", "||", "|=",
> "-", "--", "-=", "->", "->*",
> "/", "/=", "//", "/*",
> "\\",
> ".",
> "'",
> "\"",
> "#", "##",
> "and",
> "and_eq",
> "asm",
> "auto",
> ...
> );
>
> Then the code uses staticTrieFinder!([CppTokenTable])(range).
> Of course, it's also possible to define the table itself as an
> array. I'm exploring right now in search for the most
> advantageous choices.
>
> I think the above would be a true lexer in the D spirit:
>
> - exploits D's string templates to essentially define
> non-alphanumeric symbols that are easy to use and understand,
> not confined to predefined tables (that enum!) and cheap to
> compare;
>
> - exploits D's code generation abilities to generate really
> fast code using inlined trie searching;
>
> - offers and API that is generic, flexible, and infinitely
> reusable.
>
> If what we need at this point is a conventional lexer for the D
> language, std.d.lexer is the ticket. But I think it wouldn't be
> difficult to push our ambitions way beyond that. What say you?
How quickly do you think this vision could be realized? If soon,
I'd say it's worth delaying a decision on the current proposed
lexer, if not ... well, jam tomorrow, perfect is the enemy of
good, and all that ...
More information about the Digitalmars-d
mailing list