Formal review of std.lexer

Mon Apr 14 13:43:59 PDT 2014

On 21/02/2014 12:12 PM, Dicebot wrote:
> http://wiki.dlang.org/Review/std.lexer
>
> This is follow-up by Brian to his earlier proposal
> (http://wiki.dlang.org/Review/std.d.lexer). This time proposed module
> focuses instead on generic lexer generation as discussed in matching
> voting thread.
>
> Docs: http://hackerpilot.github.io/experimental/std_lexer/phobos/lexer.html
> Code: https://github.com/Hackerpilot/Dscanner/blob/master/stdx/lexer.d
>
> Example topics to evaluate during review:
>   - Feasibility of overall design and concept
>   - Quality of documentation
>   - Consistency with existing Phobos modules
>   - Overall quality of implementation
>
> Initial review will end on March the 8th.

I know the official review period is long past but I'd not had a good 
look at this module until this past weekend.

Last year I had been working on my own xml lexer/parser but as of yet I 
have nothing to show for it so I took a look a this proposal with an eye 
towards using it to make my efforts easier.

Andrei's posts about the possible design of a generic lexer had also 
influenced me, so I was expecting to find similarities between this 
module and my own work, albeit with the added benefits of being generic 
(in the good way). I have, however, found it very difficult to 
understand much of it, which I entirely put down to my own deficiencies 
with templates and especially the use of mixins.

In the example Dlang lexer, the constructor takes a ubyte[] as input and 
wraps it in a LexerRange struct which defines the normal input range 
primitives as well as various functions for lookahead. It is not 
documented whether the lexer needs these extra features or if they are 
only provided for use within the tokenising functions that are supplied 
to the template by the user. If they are used by the core of the lexer 
then it would seem to preclude the use of any other type of input that 
cannot be coerced into a ubyte[] without the effort on the part of the 
user to implement the same interface.

I think the description of the functionality required of the 
tokenSeparatingFunction that the user must supply needs to be much 
better. If I understand correctly, it is intended to differentiate 
between keywords and identifiers which begin with a keyword. THe more I 
think about this the less certain I am.

When the lexer dispatches to a token handler, the front of the range is 
left pointing to the beginning of the character sequence that was 
matched, allowing it to be included in the returned token. However, many 
of the handlers in the example Dlang lexer begin with a number of blind 
popFront calls to jump to the end of that match. I am aware that in well 
meaning code this is a case of the range being logically !empty, but I 
also wonder how often it might get overlooked when 2 matches of 
different lengths are dispatched the same handler. (I had a similar 
situation in my own code, and my solution was to have a variable storing 
the .save of my inputRange and count how many chars were consumed since 
it was updated. This way I could either return the whole match or part 
of it in the token or discard it and include only what came after it.) 
As there has been some contention about the correct use of the range 
primitives of late, I will refrain from making any other comment on 
their use in this module, especially as I am no longer sure that I have 
been using them correctly myself.

In the short time that I have been looking at the features of this lexer 
I have not been able to figure out a way of writing a standards 
compliant XML parser without having to lex some parts of the document at 
least twice, or subverting the token handlers to change behaviour 
according to context. Several non compliant single pass XML lexers would 
be possible, but they would not be able to process documents that used 
some (admittedly obscure and often overlooked) features. The only 
scalable technique that I can think of to allow XML to be lexed in a 
single pass in a fully spec compliant way would be to allow handlers to 
return multiple tokens. I am not sure how feasible this would be or what 
mechanism would be best to implement it.

On the whole I think the the overall design of the module shows promise 
but requires polish to make it both more idiomatically Dlang-y and 
easier for the user to build upon (both by documentation and interface).

On a side not related to the example lexer for Dlang, I believe the 
predicate function isDocComment will produce false positives for the 
following comment delimiters which to my knowledge are not valid DDoc 
delimiters...

//* //+ /*+ /*/ /+* /+/

As the Dlang lexer is not part of the review proper I have not inspected 
it carefully, this function just happens to be the first one declared in 
that example.

Again, my apologies for the tardiness of this review.

A...