std.d.lexer: pre-voting review / discussion

Wed Sep 11 13:37:34 PDT 2013

On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
>> 2. The example uses an if-then sequence of isBuiltType, 
>> isKeyword,
>> etc. Should be an enum so a switch can be done for speed.
>
> I believe this is probably a result of having a separate enum 
> value of
> each token, and at the same time needing to group them into 
> categories
> for syntax highlighting purposes. I'd suggest a function for 
> converting
> the TokenType enum value into a TokenCategory enum. Perhaps 
> something
> like:
>
> 	enum TokenCategory { BuiltInType, Keyword, ... }
>
> 	// Convert TokenType into TokenCategory
> 	TokenCategory category(TokenType tt) { ... }
>
> Then in user code, you call category() on the token type, and 
> switch
> over that. This maximizes performance.
>
> Implementation-wise, I'd suggest either a hash table on 
> TokenType, or
> perhaps even encoding the category into the TokenType enum 
> values, for
> example:
>
> 	enum TokenCategory {
> 		BuiltInType, Keyword, ...
> 	}
>
> 	enum TokenType {
> 		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
> 		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
> 		...
> 		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
> 	}
>
> Then the category function can be simply:
>
> 	TokenCategory category(TokenType tt) {
> 		return cast(TokenCategory)(tt >> 16);
> 	}
>
> Though admittedly, this is a bit hackish. But if you're going 
> for speed,
> this would be quite fast.

There are already plenty of hackish things in that module, so 
this would fit right in.

>> 4. When naming tokens like .. 'slice', it is giving it a
>> syntactic/semantic name rather than a token name. This would be
>> awkward if .. took on new meanings in D. Calling it 'dotdot' 
>> would
>> be clearer. Ditto for the rest. For example that is done 
>> better, '*'
>> is called 'star', rather than 'dereference'.
>
> I agree. Especially since '*' can also mean multiplication, 
> depending on
> context. It would be weird (and unnecessarily obscure) for 
> parser code
> to refer to 'dereference' when parsing expressions. :)

If you read the docs/code you'll see that "*" is called "star" 
:-). The slice -> dotdot rename is pretty simple to do.

>> 5. The LexerConfig initialization should be a constructor 
>> rather than
>> a sequence of assignments. LexerConfig documentation is 
>> awfully thin.
>> For example, 'tokenStyle' is explained as being 'Token style',
>> whatever that is.
>
> I'm on the fence about this one. Setting up the configuration 
> before
> starting the lexing process makes sense to me.
>
> Perhaps one way to improve this is to rename LexerConfig to 
> DLexer, and
> make byToken a member function (or call it via UFCS):
>
> 	DLexer lex;
> 	lex.iterStyle = ...;
> 	// etc.
>
> 	foreach (token; lex.byToken()) {
> 		...
> 	}
>
> This reads better: you're getting a list of tokens from the 
> lexer 'lex',
> as opposed to getting something from byToken(config), which 
> doesn't
> really *say* what it's doing: is it tokenizing the config 
> object?

byToken is a free function because its first parameter is the 
range to tokenize. This allows you to use UFCS on the range. 
(e.g. "sourcCode.byToken()" Putting it in a struct/class would 
break this.

>> 6. No clue how lookahead works with this. Parsing D requires 
>> arbitrary
>> lookahead.
>
> What has this got to do with lexing? The parser needs to keep 
> track of
> its state independently of the lexer anyway, doesn't it?  I'm 
> not sure
> how DMD's parser works, but the way I usually write parsers is 
> that
> their state encodes the series of tokens encountered so far, so 
> they
> don't need the lexer to save any tokens for them. If they need 
> to refer
> to tokens, they create partial AST trees on the parser stack 
> that
> reference said tokens. I don't see why it should be the lexer's 
> job to
> keep track of such things.

For parsing, you'll likely want to use array() to grab all the 
tokens. But there are other uses such as syntax highlighting that 
only need one token at a time.

>> 9. Need to insert intra-page navigation links, such as when
>> 'byToken()' appears in the text, it should be link to where 
>> byToken
>> is described.
>
> Agreed.

I'll work on this later this evening.

>
> [...]
>> I believe the state of the documentation is a showstopper, and 
>> needs
>> to be extensively fleshed out before it can be considered 
>> ready for
>> voting.
>
> I think the API can be improved. The LexerConfig -> DLexer 
> rename is an
> important readability issue IMO.

I'm not convinced (yet?).

> Also, it's unclear what types of input is supported -- the code 
> example
> only uses string input, but what about file input? Does it 
> support
> byLine? Or does it need me to slurp the entire file contents 
> into an
> in-memory buffer first?

The input is a range of bytes, which the lexer assumes is UTF-8. 
The lexer works much faster on arrays than it does on arbitrary 
ranges though. Dmitry Olshansky implemented a circular buffer in 
this module that's used to cache the input range internally. This 
buffer is then used for the lexing process.

> Now, somebody pointed out that there's currently no way to tell 
> it that
> the data you're feeding to it is a partial slice of the full 
> source. I
> think this should be an easy fix: LexerConfig (or DLexer after 
> the
> rename) can have a field for specifying the initial line number 
> / column
> number, defaulting to (1, 1) but can be changed by the caller 
> for
> parsing code fragments.

That's simple enough to add.

> A minor concern I have about the current implementation (I 
> didn't look
> at the code yet, but this is just based on the documentation), 
> is that
> there's no way to choose what kind of data you want in the 
> token stream.
> Right now, it appears that it always computes startIndex, line, 
> and
> column. What if I don't care about this information, and only 
> want, say,
> the token type and value (say I'm pretty-printing the source, 
> so I don't
> care how the original was formatted)?  Would it be possible to 
> skip the
> additional computations required to populate these fields?

It's possible, but I fear that it would make the code a mess.