std.d.lexer : voting thread
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Fri Oct 4 17:24:22 PDT 2013
On 10/2/13 7:41 AM, Dicebot wrote:
> After brief discussion with Brian and gathering data from the review
> thread, I have decided to start voting for `std.d.lexer` inclusion into
> Phobos.
Thanks all involved for the work, first of all Brian.
I have the proverbial good news and bad news. The only bad news is that
I'm voting "no" on this proposal.
But there's plenty of good news.
1. I am not attempting to veto this, so just consider it a normal vote
when tallying.
2. I do vote for inclusion in the /etc/ package for the time being.
3. The work is good and the code valuable, so even in the case my
suggestions (below) will be followed, a virtually all code pulp that
gets work done can be reused.
Vision
======
I'd been following the related discussions for a while, but I have made
up my mind today as I was working on a C++ lexer today. The C++ lexer is
for Facebook's internal linter. I'm translating the lexer from C++.
Before long I realized two simple things. First, I can't reuse anything
from Brian's code (without copying it and doing surgery on it), although
it is extremely similar to what I'm doing.
Second, I figured that it is almost trivial to implement a simple,
generic, and reusable (across languages and tasks) static trie searcher
that takes a compile-time array with all tokens and keywords and returns
the token at the front of a range with minimum comparisons.
Such a trie searcher is not intelligent, but is very composable and
extremely fast. It is just smart enough to do maximum munch (e.g.
interprets "==" and "foreach" as one token each, not two), but is not
smart enough to distinguish an identifier "whileTrue" from the keyword
"while" (it claims "while" was found and stops right at the beginning of
"True" in the stream). This is for generality so applications can define
how identifiers work (e.g. Lisp allows "-" in identifiers but D doesn't
etc). The trie finder doesn't do numbers or comments either. No regexen
of any kind.
The beauty of it all is that all of these more involved bits (many of
which are language specific) can be implemented modularly and trivially
as a postprocessing step after the trie finder. For example the user
specifies "/*" as a token to the trie finder. Whenever a comment starts,
the trie finder will find and return it; then the user implements the
alternate grammar of multiline comments.
To encode the tokens returned by the trie, we must do away with
definitions such as
enum TokenType : ushort { invalid, assign, ... }
These are fine for a tokenizer written in C, but are needless
duplication from a D perspective. I think a better approach is:
struct TokenType {
string symbol;
...
}
TokenType tok(string s)() {
static immutable string interned = s;
return TokenType(interned);
}
Instead of associating token types with small integers, we associate
them with string addresses. (For efficiency we may use pointers to
zero-terminated strings, but I don't think that's necessary). Token
types are interned by design, i.e. to compare two tokens for equality it
suffices to compare the strings with "is" (this can be extended to
general identifiers, not only statically-known tokens). Then, each token
type has a natural representation that doesn't require the user to
remember the name of the token. The left shift token is simply tok!"<<"
and is application-global.
The static trie finder does not even build a trie - it simply generates
a bunch of switch statements. The signature I've used is:
Tuple!(size_t, size_t, Token)
staticTrieFinder(alias TokenTable, R)(R r) {
It returns a tuple with (a) whitespace characters before token, (b)
newlines before token, and (c) the token itself, returned as
tok!"whatever". To use for C++:
alias CppTokenTable = TypeTuple!(
"~", "(", ")", "[", "]", "{", "}", ";", ",", "?",
"<", "<<", "<<=", "<=", ">", ">>", ">>=", "%", "%=", "=", "==", "!",
"!=",
"^", "^=", "*", "*=",
":", "::", "+", "++", "+=", "&", "&&", "&=", "|", "||", "|=",
"-", "--", "-=", "->", "->*",
"/", "/=", "//", "/*",
"\\",
".",
"'",
"\"",
"#", "##",
"and",
"and_eq",
"asm",
"auto",
...
);
Then the code uses staticTrieFinder!([CppTokenTable])(range). Of course,
it's also possible to define the table itself as an array. I'm exploring
right now in search for the most advantageous choices.
I think the above would be a true lexer in the D spirit:
- exploits D's string templates to essentially define non-alphanumeric
symbols that are easy to use and understand, not confined to predefined
tables (that enum!) and cheap to compare;
- exploits D's code generation abilities to generate really fast code
using inlined trie searching;
- offers and API that is generic, flexible, and infinitely reusable.
If what we need at this point is a conventional lexer for the D
language, std.d.lexer is the ticket. But I think it wouldn't be
difficult to push our ambitions way beyond that. What say you?
Andrei
More information about the Digitalmars-d
mailing list