std.d.lexer: pre-voting review / discussion

Wed Sep 11 15:17:21 PDT 2013

On Wed, Sep 11, 2013 at 10:37:34PM +0200, Brian Schott wrote:
> On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
> >>2. The example uses an if-then sequence of isBuiltType, isKeyword,
> >>etc. Should be an enum so a switch can be done for speed.
> >
> >I believe this is probably a result of having a separate enum value
> >of each token, and at the same time needing to group them into
> >categories for syntax highlighting purposes. I'd suggest a function
> >for converting the TokenType enum value into a TokenCategory enum.
> >Perhaps something like:
> >
> >	enum TokenCategory { BuiltInType, Keyword, ... }
> >
> >	// Convert TokenType into TokenCategory
> >	TokenCategory category(TokenType tt) { ... }
> >
> >Then in user code, you call category() on the token type, and switch
> >over that. This maximizes performance.
> >
> >Implementation-wise, I'd suggest either a hash table on TokenType, or
> >perhaps even encoding the category into the TokenType enum values,
> >for example:
> >
> >	enum TokenCategory {
> >		BuiltInType, Keyword, ...
> >	}
> >
> >	enum TokenType {
> >		IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
> >		FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
> >		...
> >		FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
> >	}
> >
> >Then the category function can be simply:
> >
> >	TokenCategory category(TokenType tt) {
> >		return cast(TokenCategory)(tt >> 16);
> >	}
> >
> >Though admittedly, this is a bit hackish. But if you're going for
> >speed, this would be quite fast.
> 
> There are already plenty of hackish things in that module, so this
> would fit right in.

Better hope it passes code review, though. :)

> >>4. When naming tokens like .. 'slice', it is giving it a
> >>syntactic/semantic name rather than a token name. This would be
> >>awkward if .. took on new meanings in D. Calling it 'dotdot' would
> >>be clearer. Ditto for the rest. For example that is done better, '*'
> >>is called 'star', rather than 'dereference'.
> >
> >I agree. Especially since '*' can also mean multiplication, depending
> >on context. It would be weird (and unnecessarily obscure) for parser
> >code to refer to 'dereference' when parsing expressions. :)
> 
> If you read the docs/code you'll see that "*" is called "star" :-).
> The slice -> dotdot rename is pretty simple to do.

I was just going by what Walter said, but OK, fair enough.

> >>5. The LexerConfig initialization should be a constructor rather
> >>than a sequence of assignments. LexerConfig documentation is awfully
> >>thin.  For example, 'tokenStyle' is explained as being 'Token
> >>style', whatever that is.
> >
> >I'm on the fence about this one. Setting up the configuration before
> >starting the lexing process makes sense to me.
> >
> >Perhaps one way to improve this is to rename LexerConfig to DLexer,
> >and make byToken a member function (or call it via UFCS):
> >
> >	DLexer lex;
> >	lex.iterStyle = ...;
> >	// etc.
> >
> >	foreach (token; lex.byToken()) {
> >		...
> >	}
> >
> >This reads better: you're getting a list of tokens from the lexer
> >'lex', as opposed to getting something from byToken(config), which
> >doesn't really *say* what it's doing: is it tokenizing the config
> >object?
> 
> byToken is a free function because its first parameter is the range to
> tokenize. This allows you to use UFCS on the range. (e.g.
> "sourcCode.byToken()" Putting it in a struct/class would break this.

Very good point. In that case, would it be possible to make byToken
accept configuration parameters inline? That is:

	auto tokens = File(mysource)
		.byLine() // does this actually work??
		.byToken(IterationStyle.everything, ...);

This seems better suited for UFCS-style chains, as opposed to needing to
create a separate config object outside the chain just so you can use it
inside. And writing LexerConfig(...) inline seems a bit cumbersome:

	auto tokens = File(mysource)
		.byLine() // does this actually work??
		.byToken(LexerConfig(IterationStyle.everything, ...));

though, in retrospect, perhaps not as bad as it initially sounds.

> >>6. No clue how lookahead works with this. Parsing D requires
> >>arbitrary lookahead.
> >
> >What has this got to do with lexing? The parser needs to keep track
> >of its state independently of the lexer anyway, doesn't it?  I'm not
> >sure how DMD's parser works, but the way I usually write parsers is
> >that their state encodes the series of tokens encountered so far, so
> >they don't need the lexer to save any tokens for them. If they need
> >to refer to tokens, they create partial AST trees on the parser stack
> >that reference said tokens. I don't see why it should be the lexer's
> >job to keep track of such things.
> 
> For parsing, you'll likely want to use array() to grab all the
> tokens. But there are other uses such as syntax highlighting that
> only need one token at a time.

I don't think that's necessary. An appropriately-constructed parser
doesn't need arbitrary slicing of tokens; its internal state should
encode what tokens have been seen thus far so that it never needs to
backtrack.

[...]
> >I think the API can be improved. The LexerConfig -> DLexer rename is
> >an important readability issue IMO.
> 
> I'm not convinced (yet?).

OK, I take that back. UFCS chaining is an important emerging D paradigm
that we should support, and the current API seems better suited for
that purpose.

> >Also, it's unclear what types of input is supported -- the code
> >example only uses string input, but what about file input? Does it
> >support byLine? Or does it need me to slurp the entire file contents
> >into an in-memory buffer first?
> 
> The input is a range of bytes, which the lexer assumes is UTF-8. The
> lexer works much faster on arrays than it does on arbitrary ranges
> though. Dmitry Olshansky implemented a circular buffer in this module
> that's used to cache the input range internally. This buffer is then
> used for the lexing process.

OK, if this can be indicated in the docs, that'd be great. :) (A
signature constraint of the form `if (is(ElementType!R == char))` that
should be good enough, I think, since DDOC in DMD git HEAD will include
signature constraints in the generated docs, and it should be clear
enough what is expected.

[...]
> >A minor concern I have about the current implementation (I didn't
> >look at the code yet, but this is just based on the documentation),
> >is that there's no way to choose what kind of data you want in the
> >token stream.  Right now, it appears that it always computes
> >startIndex, line, and column. What if I don't care about this
> >information, and only want, say, the token type and value (say I'm
> >pretty-printing the source, so I don't care how the original was
> >formatted)?  Would it be possible to skip the additional computations
> >required to populate these fields?
> 
> It's possible, but I fear that it would make the code a mess.

Fair enough. It's not a significant overhead anyway (I believe), so it
shouldn't be a big issue.

T

-- 
Once bitten, twice cry...