std.d.lexer requirements

Wed Aug 1 21:30:44 PDT 2012

On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
>> 1. It should accept as input an input range of UTF8. I feel it is a mistake
>> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
>> should use an 'adapter' range to convert the input to UTF8. (This is what
>> component programming is all about.)
>
> But that's not how ranges of characters work. They're ranges of dchar. Ranges
> don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have to create
> special wrappers around string or wstring to have ranges of UTF-8. The way
> that it's normally done is to have ranges of dchar and then special-case
> range-based functions for strings. Then the function can operate on any range
> of dchar but still operates on strings efficiently.

I have argued against making ranges of characters dchars, because of performance 
reasons. This will especially adversely affect the performance of the lexer.

The fact is, files are normally in UTF8 and just about everything else is in 
UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that the 
single largest thing holding back regex performance is that premature conversion 
to dchar and back to char.

If lexer is required to accept dchar ranges, its performance will drop at least 
in half, and people are going to go reinvent their own lexers.

> So, it's quite possible to make a lexer operate on ranges of dchar but be
> operating primarily on ASCII by special-casing strings and avoiding decoding
> as much as possible. My lexer does a good job of this, I believe, so it's
> thoroughly optimized for strings, but it's still operating on ranges of dchar,
> not ranges of UTF-8.
>
>> 2. It should output an input range of tokens
>>
>> 3. tokens should be values, not classes
>>
>> 4. It should avoid memory allocation as much as possible
>>
>> 5. It should read or write any mutable global state outside of its "Lexer"
>> instance
>>
>> 6. A single "Lexer" instance should be able to serially accept input ranges,
>> sharing and updating one identifier table
>
> When you say identifier table, do you mean symbol table, or something else?

All identifiers are entered into a hashtable, and are referred to by pointers 
into that hashtable for the rest of dmd. This makes symbol lookups incredibly 
fast, as no string comparisons are done.

> Isn't the symbol table something for the parser to worry about? Other than
> parsers, I can't think of anything which would even _care_ about a symbol
> table. And depending on how a symbol table were done, you'd want it to take
> scoping rules into account (_I'd_ certainly expect it to), meaning that it
> only includes identifiers which are relevant to the current scope. And if
> that's the case, it _really_ doesn't belong in the lexer. The code using the
> lexer can easily have its own table that it populates according to however it
> wants it populated by processing the identifier tokens as they get lexed. Not
> to mention, don't symbol tables usually include type information that a lexer
> _wouldn't_ have?

The symbol table becomes a symbol table of pointers into the lexer's identifier 
table.

> Sure, a symbol table _could_ be put in the lexer (at least if it doesn't care
> about scoping), but I definitely question that that's the right place for it.

The symbol table isn't in the lexer, the identifier string table is.

> If you mean a table that simply lists identifiers rather than a symbol table,
> I'm not quite sure why you'd want it in addition to the symbol table, and
> again, I'd expect only parsers to care about that sort of thing, so I'd expect
> the parser to be implementing it.

Because it is gawdammed fast to do it that way. An identifier is treated as a 
string exactly once, in the lexer, and thereafter by its unique handle.

> So, what exactly are you looking for the lexer to implement as far as an
> identifier table goes, and why is it in the lexer rather than the parser?

Very simple. A mapping of [identifier string] => [unique value], and a pointer 
servers nicely as that unique value.

>> 7. It should accept a callback delegate for errors. That delegate should
>> decide whether to:
>>      1. ignore the error (and "Lexer" will try to recover and continue)
>>      2. print an error message (and "Lexer" will try to recover and continue)
>> 3. throw an exception, "Lexer" is done with that input range
>
> I'm currently treating errors as tokens. It then becomes easy for the code
> using the lexer to just ignore the errors, to process them immediately, or to
> put off dealing with them until the lexing is complete. So, the code using the
> lexer can handle errors however and whenever it likes without having to worry
> about delegates or exceptions. Since tokens are lexed lazily, the fact that an
> error is reported as a token doesn't require that the lexing continue, but it
> also makes it _easy_ to continue lexing, ignoring or saving the error. I'm
> inclined to think that that's a superior approach to using delegates and
> exceptions.

I don't agree. It means that the parser has to check everywhere for error 
tokens. Besides the ugliness, it means a slowdown for those checks.

>> 8. Lexer should be configurable as to whether it should collect information
>> about comments and ddoc comments or not
>>
>> 9. Comments and ddoc comments should be attached to the next following
>> token, they should not themselves be tokens
>
> Why?

Because then the parser would have to add checks for 0 or more comment tokens 
between every other token. It uglifies the parser pretty badly, and slows it down.

> I'd argue for just processing them as tokens just like I'm doing with
> errors. The code using the lexer can then process them or ignore them as it
> likes (it can even specifically look for them and ignore all other tokens if
> it's something which only operates on comments). And if you want to see which
> token follows the comment, it's the next one in the range. So, I don't see
> much need to attach comments to other tokens. What's the advantage of not
> treating them as tokens?

Performance and simplicity. Your grammar will be awfully ugly unless you insert 
another range to filter out those tokens - but that'll be a slowdown.

>> 12. It should come with unittests that, using -cov, show 100% coverage
>
> 100% is actually unreasonable, because there are lines which should _never_ be
> hit (e.g. an assert(0) line in a switch statement). All lines _other_ than
> those sort of lines should have code coverage, but depending on the lexer
> implementation, 100% is impossible. std.datetime has a ton of unit tests and
> IIRC it still only manages 98% because of assert(0) lines. So, I agree with
> the spirit of your statement, but it's not necessarily reasonable to do
> exactly what you're saying.

I know, but I expect all unreachable code to be justified. The lexer is a well 
defined problem, and this should not be difficult.