std.d.lexer: pre-voting review / discussion

Wed Sep 11 15:02:46 PDT 2013

On Wed, Sep 11, 2013 at 01:31:45PM -0700, Walter Bright wrote:
> On 9/11/2013 12:55 PM, H. S. Teoh wrote:
> >>3. I assumed TokenType is a type. But it's not, it's an enum. Even
> >>the document says it's a 'type', but it's not a type.
> >
> >I don't understand this complaint. I thought it's pretty clear that
> >it's referring to what type of token it is (or would you prefer
> >TokenKind?).
> 
> A type is a well-understood concept in D. This is using 'type' to
> mean something completely different. This is very jarring. Your idea
> of using 'kind' is vastly superior.

No, in typical D code, user-defined type names are capitalized, such as
InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in
their names (we don't call them InputRangeType, for example). So this
usage doesn't conflict at all.

> >>5. The LexerConfig initialization should be a constructor rather
> >>than a sequence of assignments. LexerConfig documentation is awfully
> >>thin.  For example, 'tokenStyle' is explained as being 'Token
> >>style', whatever that is.
> >
> >I'm on the fence about this one. Setting up the configuration before
> >starting the lexing process makes sense to me.
> 
> Yes, using a constructor :-)

Constructors with many arguments are evil, because you can't just modify
one setting and use defaults for the rest; you have to specify all of
them up to the last default argument you have to change.

> >>6. No clue how lookahead works with this. Parsing D requires
> >>arbitrary lookahead.
> >
> >What has this got to do with lexing?
> 
> The lexer needs to support backing up and going forward again.
> That's how the dmd lexer works.

Then the parser should be fixed. Keeping track of what tokens have been
seen so far is the parser's job, not the lexer's. LALR(1) parsers, for
example, deal with multiple similar-looking constructs without requiring
more than a single token lookahead. The trick is that you do not assume
a particular grammar production until it has been fully disambiguated.
The grammar, if not conducive to this, can usually be rearranged to make
it possible.  For example, if the grammar has two production rules:

	StatementA ::= TokenA TokenB TokenC TokenD
	StatementB ::= TokenA TokenB TokenC TokenE

then you may claim that you need a 4-token lookahead in order to tell,
given TokenA in the current input, which type of statement it's going to
be. However, these rules can be rewritten like this:

	StatementABPrefix ::= TokenA TokenB TokenC
	StatementA ::= StatementABPrefix TokenD
	StatementB ::= StatementABPrefix TokenE

Now the parser only needs a 1-token lookahead to do its job, and no
backtracking is ever needed. Any information encoded by the first 3
tokens can be stored in the AST node of StatementABPrefix, so it's
recoverable without needing to save previous tokens.

Unless, you're saying that D grammar is more complex than LALR(1)... If
so, I'd like to see an example of where this is the case.

> >Also, it's unclear what types of input is supported -- the code
> >example only uses string input, but what about file input? Does it
> >support byLine? Or does it need me to slurp the entire file contents
> >into an in-memory buffer first?
> 
> The point of having the input be an InputRange is you DO NOT CARE if
> the input is coming from a string, file, etc.

Yes, but the docs currently doesn't say what form this input range must
take. Must it be a range of characters? Am I allowed to pass in a range
of *strings* (representing source lines)? This needs to be clearly
stated.

The fact that tokens slice the input also needs to be clearly stated,
since that precludes the use of transient ranges like byLine (your token
values will be garbled by the time you get to them).

T

-- 
Let's call it an accidental feature. -- Larry Wall