std.d.lexer: pre-voting review / discussion

Wed Sep 11 12:55:18 PDT 2013

On Wed, Sep 11, 2013 at 11:49:44AM -0700, Walter Bright wrote:
> On 9/11/2013 8:01 AM, Dicebot wrote:
> >std.d.lexer is standard module for lexing D code, written by Brian
> >Schott
> 
> Thank you, Brian! This is important work.
> 
> Not a thorough review, just some notes from reading the doc file:
> 
> 1. I don't like the _ suffix for keywords. Just call it kwimport or
> something like that.

I agree.

> 2. The example uses an if-then sequence of isBuiltType, isKeyword,
> etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of
each token, and at the same time needing to group them into categories
for syntax highlighting purposes. I'd suggest a function for converting
the TokenType enum value into a TokenCategory enum. Perhaps something
like:

	enum TokenCategory { BuiltInType, Keyword, ... }

	// Convert TokenType into TokenCategory
	TokenCategory category(TokenType tt) { ... }

Then in user code, you call category() on the token type, and switch
over that. This maximizes performance.

Implementation-wise, I'd suggest either a hash table on TokenType, or
perhaps even encoding the category into the TokenType enum values, for
example:

	enum TokenCategory {
		BuiltInType, Keyword, ...
	}

	enum TokenType {
		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
		...
		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
	}

Then the category function can be simply:

	TokenCategory category(TokenType tt) {
		return cast(TokenCategory)(tt >> 16);
	}

Though admittedly, this is a bit hackish. But if you're going for speed,
this would be quite fast.

> 3. I assumed TokenType is a type. But it's not, it's an enum. Even
> the document says it's a 'type', but it's not a type.

I don't understand this complaint. I thought it's pretty clear that it's
referring to what type of token it is (or would you prefer TokenKind?).

> 4. When naming tokens like .. 'slice', it is giving it a
> syntactic/semantic name rather than a token name. This would be
> awkward if .. took on new meanings in D. Calling it 'dotdot' would
> be clearer. Ditto for the rest. For example that is done better, '*'
> is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on
context. It would be weird (and unnecessarily obscure) for parser code
to refer to 'dereference' when parsing expressions. :)

> 5. The LexerConfig initialization should be a constructor rather than
> a sequence of assignments. LexerConfig documentation is awfully thin.
> For example, 'tokenStyle' is explained as being 'Token style',
> whatever that is.

I'm on the fence about this one. Setting up the configuration before
starting the lexing process makes sense to me.

Perhaps one way to improve this is to rename LexerConfig to DLexer, and
make byToken a member function (or call it via UFCS):

	DLexer lex;
	lex.iterStyle = ...;
	// etc.

	foreach (token; lex.byToken()) {
		...
	}

This reads better: you're getting a list of tokens from the lexer 'lex',
as opposed to getting something from byToken(config), which doesn't
really *say* what it's doing: is it tokenizing the config object?

> 6. No clue how lookahead works with this. Parsing D requires arbitrary
> lookahead.

What has this got to do with lexing? The parser needs to keep track of
its state independently of the lexer anyway, doesn't it?  I'm not sure
how DMD's parser works, but the way I usually write parsers is that
their state encodes the series of tokens encountered so far, so they
don't need the lexer to save any tokens for them. If they need to refer
to tokens, they create partial AST trees on the parser stack that
reference said tokens. I don't see why it should be the lexer's job to
keep track of such things.

[...]
> 8. 'default_' Again with the awful practice of appending _.

Yeah I didn't like that. Prefixing keywords with "kw" or "kw_" is much
better, instead of the confusing way of appending _ to some token types
but not others.

> 9. Need to insert intra-page navigation links, such as when
> 'byToken()' appears in the text, it should be link to where byToken
> is described.

Agreed.

[...]
> I believe the state of the documentation is a showstopper, and needs
> to be extensively fleshed out before it can be considered ready for
> voting.

I think the API can be improved. The LexerConfig -> DLexer rename is an
important readability issue IMO.

Also, it's unclear what types of input is supported -- the code example
only uses string input, but what about file input? Does it support
byLine? Or does it need me to slurp the entire file contents into an
in-memory buffer first?

Now, somebody pointed out that there's currently no way to tell it that
the data you're feeding to it is a partial slice of the full source. I
think this should be an easy fix: LexerConfig (or DLexer after the
rename) can have a field for specifying the initial line number / column
number, defaulting to (1, 1) but can be changed by the caller for
parsing code fragments.

A minor concern I have about the current implementation (I didn't look
at the code yet, but this is just based on the documentation), is that
there's no way to choose what kind of data you want in the token stream.
Right now, it appears that it always computes startIndex, line, and
column. What if I don't care about this information, and only want, say,
the token type and value (say I'm pretty-printing the source, so I don't
care how the original was formatted)?  Would it be possible to skip the
additional computations required to populate these fields?

T

-- 
MACINTOSH: Most Applications Crash, If Not, The Operating System Hangs