Request for comments: std.d.lexer

Mon Jan 28 13:04:39 PST 2013

On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
> 28-Jan-2013 15:48, Johannes Pfau пишет:
>> ...
>>
>> But to be fair that doesn't fit ranges very well. If you don't want to
>> do any allocation but still keep identifiers etc in memory this
>> basically means you have to keep the whole source in memory and this is
>> conceptually an array and not a range.
>>
>
> Not the whole source but to construct a table of all identifiers. The
> source is awfully redundant because of repeated identifiers, spaces,
> comments and what not. The set of unique identifiers is rather small.
>

Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My 
own lexer-parser combination slices directly into the original source 
code, for every token, in order to remember the exact source location, 
and the last time I have measured, it ran faster than DMD's. I keep the 
source around for error reporting anyway:

tt.d:27:5: error: no member 'n' for type 'A'
     a.n=2;
     ^~~

Since the tokens point directly into the source code, it is not 
necessary to construct any other data structures in order to allow fast 
retrieval of the appropriate source code line.

But it's clear that a general-purpose library might not want to impose 
this restriction on storage upon it's clients. I think it is somewhat 
helpful for speed though. The other thing I do is buffering tokens in a 
contiguous ring buffer that grows if a lot of lookahead is requested.

> I think the best course of action is to just provide a hook to trigger
> on every identifier encountered. That could be as discussed earlier a
> delegate.
>
> ...

Maybe. I map identifiers to unique id's later, in the identifier AST 
node constructor, though.