std.d.lexer : voting thread

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Wed Oct 9 00:49:55 PDT 2013


On 10/8/13 11:11 PM, ilya-stromberg wrote:
> On Tuesday, 8 October 2013 at 00:16:45 UTC, Andrei Alexandrescu wrote:
>> To put my money where my mouth is, I have a proof-of-concept tokenizer
>> for C++ in working state.
>>
>> http://dpaste.dzfl.pl/d07dd46d
>
> Why do you use "\0" as end-of-stream token:
>
>    /**
>     * All token types include regular and reservedTokens, plus the null
>     * token ("") and the end-of-stream token ("\0").
>     */
>
> We can have situation when the "\0" is a valid token, for example for
> binary formats. Is it possible to indicate end-of-stream another way,
> maybe via "empty" property for range-based API?

I'm glad you asked. It's simply a decision by convention. I know no C++ 
source can contain a "\0", so I append it to the input and use it as a 
sentinel.

A general lexer should take the EOF symbol as a parameter.

One more thing: the trie matcher knows a priori (statically) what the 
maximum lookahead is - it's the maximum of all symbols. That can be used 
to pre-fill the input buffer such that there's never an out-of-bounds 
access, even with input ranges.


Andrei



More information about the Digitalmars-d mailing list