std.d.lexer : voting thread

Fri Oct 4 17:24:22 PDT 2013

On 10/2/13 7:41 AM, Dicebot wrote:
> After brief discussion with Brian and gathering data from the review
> thread, I have decided to start voting for `std.d.lexer` inclusion into
> Phobos.

Thanks all involved for the work, first of all Brian.

I have the proverbial good news and bad news. The only bad news is that 
I'm voting "no" on this proposal.

But there's plenty of good news.

1. I am not attempting to veto this, so just consider it a normal vote 
when tallying.

2. I do vote for inclusion in the /etc/ package for the time being.

3. The work is good and the code valuable, so even in the case my 
suggestions (below) will be followed, a virtually all code pulp that 
gets work done can be reused.

Vision
======

I'd been following the related discussions for a while, but I have made 
up my mind today as I was working on a C++ lexer today. The C++ lexer is 
for Facebook's internal linter. I'm translating the lexer from C++.

Before long I realized two simple things. First, I can't reuse anything 
from Brian's code (without copying it and doing surgery on it), although 
it is extremely similar to what I'm doing.

Second, I figured that it is almost trivial to implement a simple, 
generic, and reusable (across languages and tasks) static trie searcher 
that takes a compile-time array with all tokens and keywords and returns 
the token at the front of a range with minimum comparisons.

Such a trie searcher is not intelligent, but is very composable and 
extremely fast. It is just smart enough to do maximum munch (e.g. 
interprets "==" and "foreach" as one token each, not two), but is not 
smart enough to distinguish an identifier "whileTrue" from the keyword 
"while" (it claims "while" was found and stops right at the beginning of 
"True" in the stream). This is for generality so applications can define 
how identifiers work (e.g. Lisp allows "-" in identifiers but D doesn't 
etc). The trie finder doesn't do numbers or comments either. No regexen 
of any kind.

The beauty of it all is that all of these more involved bits (many of 
which are language specific) can be implemented modularly and trivially 
as a postprocessing step after the trie finder. For example the user 
specifies "/*" as a token to the trie finder. Whenever a comment starts, 
the trie finder will find and return it; then the user implements the 
alternate grammar of multiline comments.

To encode the tokens returned by the trie, we must do away with 
definitions such as

enum TokenType : ushort { invalid, assign, ... }

These are fine for a tokenizer written in C, but are needless 
duplication from a D perspective. I think a better approach is:

struct TokenType {
   string symbol;
   ...
}

TokenType tok(string s)() {
   static immutable string interned = s;
   return TokenType(interned);
}

Instead of associating token types with small integers, we associate 
them with string addresses. (For efficiency we may use pointers to 
zero-terminated strings, but I don't think that's necessary). Token 
types are interned by design, i.e. to compare two tokens for equality it 
suffices to compare the strings with "is" (this can be extended to 
general identifiers, not only statically-known tokens). Then, each token 
type has a natural representation that doesn't require the user to 
remember the name of the token. The left shift token is simply tok!"<<" 
and is application-global.

The static trie finder does not even build a trie - it simply generates 
a bunch of switch statements. The signature I've used is:

Tuple!(size_t, size_t, Token)
staticTrieFinder(alias TokenTable, R)(R r) {

It returns a tuple with (a) whitespace characters before token, (b) 
newlines before token, and (c) the token itself, returned as 
tok!"whatever". To use for C++:

alias CppTokenTable = TypeTuple!(
   "~", "(", ")", "[", "]", "{", "}", ";", ",", "?",
   "<", "<<", "<<=", "<=", ">", ">>", ">>=", "%", "%=", "=", "==", "!", 
"!=",
   "^", "^=", "*", "*=",
   ":", "::", "+", "++", "+=", "&", "&&", "&=", "|", "||", "|=",
   "-", "--", "-=", "->", "->*",
   "/", "/=", "//", "/*",
   "\\",
   ".",
   "'",
   "\"",
   "#", "##",
   "and",
   "and_eq",
   "asm",
   "auto",
   ...
);

Then the code uses staticTrieFinder!([CppTokenTable])(range). Of course, 
it's also possible to define the table itself as an array. I'm exploring 
right now in search for the most advantageous choices.

I think the above would be a true lexer in the D spirit:

- exploits D's string templates to essentially define non-alphanumeric 
symbols that are easy to use and understand, not confined to predefined 
tables (that enum!) and cheap to compare;

- exploits D's code generation abilities to generate really fast code 
using inlined trie searching;

- offers and API that is generic, flexible, and infinitely reusable.

If what we need at this point is a conventional lexer for the D 
language, std.d.lexer is the ticket. But I think it wouldn't be 
difficult to push our ambitions way beyond that. What say you?

Andrei