Formal review of std.lexer

Mon Apr 14 14:34:09 PDT 2014

On Monday, 14 April 2014 at 20:43:58 UTC, Alix Pexton wrote:
> I know the official review period is long past but I'd not had 
> a good look at this module until this past weekend.
>
> Last year I had been working on my own xml lexer/parser but as 
> of yet I have nothing to show for it so I took a look a this 
> proposal with an eye towards using it to make my efforts easier.
>
> Andrei's posts about the possible design of a generic lexer had 
> also influenced me, so I was expecting to find similarities 
> between this module and my own work, albeit with the added 
> benefits of being generic (in the good way). I have, however, 
> found it very difficult to understand much of it, which I 
> entirely put down to my own deficiencies with templates and 
> especially the use of mixins.
>
> In the example Dlang lexer, the constructor takes a ubyte[] as 
> input and wraps it in a LexerRange struct which defines the 
> normal input range primitives as well as various functions for 
> lookahead. It is not documented whether the lexer needs these 
> extra features or if they are only provided for use within the 
> tokenising functions that are supplied to the template by the 
> user. If they are used by the core of the lexer then it would 
> seem to preclude the use of any other type of input that cannot 
> be coerced into a ubyte[] without the effort on the part of the 
> user to implement the same interface.

ubyte[] is required for the lexer to work quickly. That lexer 
range is designed to keep track of the column and line numbers.

> I think the description of the functionality required of the 
> tokenSeparatingFunction that the user must supply needs to be 
> much better. If I understand correctly, it is intended to 
> differentiate between keywords and identifiers which begin with 
> a keyword. THe more I think about this the less certain I am.

That function's purpose is to determine if the current code unit 
signifies the end of an identifier/keyword. When lexing 
"fortunate", the lexer would spot "for", and then call the 
separating function with the index of "u". In the case of D, it 
would say that this does NOT end a word, and "fortunate" should 
be lexed as an identifier. If it was called with an index of a 
space or parenthesis, it should return true.

> When the lexer dispatches to a token handler, the front of the 
> range is left pointing to the beginning of the character 
> sequence that was matched, allowing it to be included in the 
> returned token. However, many of the handlers in the example 
> Dlang lexer begin with a number of blind popFront calls to jump 
> to the end of that match. I am aware that in well meaning code 
> this is a case of the range being logically !empty, but I also 
> wonder how often it might get overlooked when 2 matches of 
> different lengths are dispatched the same handler. (I had a 
> similar situation in my own code, and my solution was to have a 
> variable storing the .save of my inputRange and count how many 
> chars were consumed since it was updated. This way I could 
> either return the whole match or part of it in the token or 
> discard it and include only what came after it.) As there has 
> been some contention about the correct use of the range 
> primitives of late, I will refrain from making any other 
> comment on their use in this module, especially as I am no 
> longer sure that I have been using them correctly myself.

If more than one prefix is dispatched to the same handler, the 
handler cannot blindly call popFront or popFrontN. If the lexer 
author knows that the handler will only be called in certain 
places, than checking !empty is a waste of cycles. I don't think 
it's possible for the lexer generator to enforce this.

> In the short time that I have been looking at the features of 
> this lexer I have not been able to figure out a way of writing 
> a standards compliant XML parser without having to lex some 
> parts of the document at least twice, or subverting the token 
> handlers to change behaviour according to context. Several non 
> compliant single pass XML lexers would be possible, but they 
> would not be able to process documents that used some 
> (admittedly obscure and often overlooked) features. The only 
> scalable technique that I can think of to allow XML to be lexed 
> in a single pass in a fully spec compliant way would be to 
> allow handlers to return multiple tokens. I am not sure how 
> feasible this would be or what mechanism would be best to 
> implement it.

XML doesn't seem to have very distinct lexing/parsing phases like 
JSON markup or Java code does. This lexer generator may not be 
the best way to deal with XML.

> On the whole I think the the overall design of the module shows 
> promise but requires polish to make it both more idiomatically 
> Dlang-y and easier for the user to build upon (both by 
> documentation and interface).

If you have specific problems, please list them.

> On a side not related to the example lexer for Dlang, I believe 
> the predicate function isDocComment will produce false 
> positives for the following comment delimiters which to my 
> knowledge are not valid DDoc delimiters...
>
> //* //+ /*+ /*/ /+* /+/

https://github.com/Hackerpilot/Dscanner/issues/163

>
> As the Dlang lexer is not part of the review proper I have not 
> inspected it carefully, this function just happens to be the 
> first one declared in that example.

If you do find issues with the lexer or parser, please file them 
here:
https://github.com/Hackerpilot/Dscanner/issues

> Again, my apologies for the tardiness of this review.
>
> A...