Formal review of std.lexer

Tue Apr 15 02:28:30 PDT 2014

On 14/04/2014 10:34 PM, Brian Schott wrote:
> ubyte[] is required for the lexer to work quickly. That lexer range is
> designed to keep track of the column and line numbers.

I can understand that speed requires the input to be handled as bytes 
instead of chars, but the restriction to an ubyte[] over an 
randomAccessRange seems to me un-Dlang-y.

If the LexerRange is only there to optionally add line/column numbering 
support then I think it needs a much more descriptive name and much 
better documentation.

> That function's purpose is to determine if the current code unit
> signifies the end of an identifier/keyword. When lexing "fortunate", the
> lexer would spot "for", and then call the separating function with the
> index of "u". In the case of D, it would say that this does NOT end a
> word, and "fortunate" should be lexed as an identifier. If it was called
> with an index of a space or parenthesis, it should return true.

Somehow I had skipped the salient part of the documentation for this, so 
my apologies (at first I thought that the DDoc output I was reading must 
have been out of sync with the source (it was, but not that much), but 
further investigations suggest some form of temporary blindness).

This description squares with what I had surmised from reading the code, 
and I can see why it would be more efficient than the common technique 
of comparing every matched identifier to the list of keywords. I do 
wonder however if there might be another way to attain the same 
efficiency without the need for the separating function (I should have 
replied to this last night when my ideas on the matter were clearer, 
sleep seems to have stolen my theoretical alternative ><).

I'm also curious about the decision that the signature of the separating 
function should take the offset to the character than needs to be 
tested. A more Dlang-y thing to do in my mind would be to pass a range 
that begins at the first character after the matched keyword.

> If more than one prefix is dispatched to the same handler, the handler
> cannot blindly call popFront or popFrontN. If the lexer author knows
> that the handler will only be called in certain places, than checking
> !empty is a waste of cycles. I don't think it's possible for the lexer
> generator to enforce this.

I don't think the lexer needs to do anything extra to ensure that a 
handler can begin its work both without repeating calls to .empty or 
blindly calling popFront. To do so requires that the input be treated a 
a forward range. Before reading a token, store the .save of the input, 
then advance the input as the token is matched counting the consumed 
elements. When the handler is called it will have the option of 
including the token by adding to the count and returning a slice that 
begins at the .save'd position or ignoring the length of the match and 
returning a slice that begins at the position after the matched token.

I strongly believe that a model that requires the user to reason about a 
library providing ranges that are "logicaly !empty" is a misstep.

> XML doesn't seem to have very distinct lexing/parsing phases like JSON
> markup or Java code does. This lexer generator may not be the best way
> to deal with XML.

I don't really know how the grammar in the XML standard compares to 
others in general. It is certainly more convoluted than the published 
grammar for Dlang, but we all know that that doesn't quite match the 
language. Some criticise XML for being too complicated, but in some 
respects it is even more complicated than that, it seems ironic that it 
was supposed to be simple.

But, I might have typed too soon when I wrote that this lexer might not 
be able to fully tokenize XML in a single pass. As I was drifting off to 
sleep last night I had the idea of using the extraFields of 
TokenStructure to add a token pointer (or equivalent construct) to the 
tokens, making a singly linked list and allowing me to return multiple 
tokens from a token handler. I just need to convince myself that it is a 
sane thing to do.

> If you have specific problems, please list them.

I think the calculator example should be improved or replaced so that 
the documentation's first example uses all the string[]s that can be 
passed to the lexer. Perhaps a simple symbolic calculator that accepts 
letter sequences as variables and reserves pi as a keyword?

I think that the section that describes the token handlers needs to 
fully document the primitives that the user has access to in order to 
lex a token and what state to expect them to be in when the handler is 
called and what state they should be in when it returns.

If tracking lines and columns is optional and supported only when 
wrapping the input with LexerRange then I think that by default Tokens 
should not contain fields for them. Perhaps an enum qstring that 
declares the required fields can be introduced and an example shown 
where it is concatenated with user declarations to be passed in via 
extraFields.

I'll again restate that I think LexerRange needs a name that better 
reflects its functionality. From reading the code I see that it does do 
a little more that simply track line/column by providing more efficient 
implementations of some of the range primitives. I Think that the user 
should be able to take advantage of these savings without having to also 
commit to line/column tracking. I think there should also be the option 
of tracking only lines, as column information is not always as accurate 
or useful.

I think that is all my concerns for now, I'll leave my crazy ideas for 
"improvements" for another day.

> If you do find issues with the lexer or parser, please file them here:
> https://github.com/Hackerpilot/Dscanner/issues

Will do!

A...