std.d.lexer : voting thread

Fri Oct 11 11:32:09 PDT 2013

On 10/11/13 2:17 AM, Dmitry Olshansky wrote:
> 06-Oct-2013 20:07, Andrei Alexandrescu пишет:
>> On 10/6/13 5:40 AM, Joseph Rushton Wakeling wrote:
>>> How quickly do you think this vision could be realized? If soon, I'd say
>>> it's worth delaying a decision on the current proposed lexer, if not ...
>>> well, jam tomorrow, perfect is the enemy of good, and all that ...
>>
>> I'm working on related code, and got all the way there in one day
>> (Friday) with a C++ tokenizer for linting purposes (doesn't open
>> #includes or expand #defines etc; it wasn't meant to).
>>
>> The core generated fragment that does the matching is at
>> https://dpaste.de/GZY3.
>>
>> The surrounding switch statement (also in library code) handles
>> whitespace and line counting. The client code needs to handle by hand
>> things like parsing numbers (note how the matcher stops upon the first
>> digit), identifiers, comments (matcher stops upon detecting "//" or
>> "/*") etc. Such things can be achieved with hand-written code (as I do),
>> other similar tokenizers, DFAs, etc. The point is that the core loop
>> that looks at every character looking for a lexeme is fast.
>
> This is something I agree with.
> I'd call that loop the "dispatcher loop" in a sense that it detects the
> kind of stuff and forwards to a special hot loop for that case (if any,
> e.g. skipping comments).
>
> BTW it absolutely must be able to do so in one step, the generated code
> already knows that the token is tok!"//" hence it may call proper
> handler right there.
>
> case '/':
> ... switch(s[1]){
> ...
>      case '/':
>          // it's a pseudo token anyway so instead of
>          //t = tok!"//";
>
>          // just _handle_ it!
>          t = hookFor!"//"(); //user hook for pseudo-token
>          // eats whitespace & returns tok!"comment" or some such
>          // if need be
>          break token_scan;
> }
>
> This also helps to get not only "raw" tokens but allow user to cook
> extra tokens by hand for special cases that can't be handled by
> "dispatcher loop".

That's a good idea. The only concerns I have are:

* I'm biased toward patterns for laying efficient code, having hacked 
into such for the past year. Even discounting for that, I have the 
feeling that speed is near the top of the list of people who evaluate 
lexer generators. I fear that too much inline code present inside a 
fairly large switch statement may hurt efficiency, which is why I'm 
biased in favor of "small core loop dispatching upon the first few 
characters, out-of-line code for handling particular cases that need 
attention".

* I've grown to be a big fan of the simplicity of the generator. Yes, 
that also means bare on features but it's simple enough to be used 
casually for the simplest tasks that people wouldn't normally think of 
using a lexer for. If we add hookFor, it would be great if it didn't 
impact simplicity a lot.

Andrei