Request for comments: std.d.lexer

Wed Jan 30 08:44:06 PST 2013

29-Jan-2013 01:04, Timon Gehr пишет:
> On 01/28/2013 12:59 PM, Dmitry Olshansky wrote:
>> 28-Jan-2013 15:48, Johannes Pfau пишет:
>>> ...
>>>
>>> But to be fair that doesn't fit ranges very well. If you don't want to
>>> do any allocation but still keep identifiers etc in memory this
>>> basically means you have to keep the whole source in memory and this is
>>> conceptually an array and not a range.
>>>
>>
>> Not the whole source but to construct a table of all identifiers. The
>> source is awfully redundant because of repeated identifiers, spaces,
>> comments and what not. The set of unique identifiers is rather small.
>>
>
> Source code is usually small. (Even std.datetime has 'only' 1.6 MB.) My
> own lexer-parser combination slices directly into the original source
> code, for every token, in order to remember the exact source location,
> and the last time I have measured, it ran faster than DMD's. I keep the
> source around for error reporting anyway:
>
> tt.d:27:5: error: no member 'n' for type 'A'
>      a.n=2;
>      ^~~
>
> Since the tokens point directly into the source code, it is not
> necessary to construct any other data structures in order to allow fast
> retrieval of the appropriate source code line.
>
> But it's clear that a general-purpose library might not want to impose
> this restriction on storage upon it's clients. I think it is somewhat
> helpful for speed though. The other thing I do is buffering tokens in a
> contiguous ring buffer that grows if a lot of lookahead is requested.
>
>> I think the best course of action is to just provide a hook to trigger
>> on every identifier encountered. That could be as discussed earlier a
>> delegate.
>>
>> ...
>
> Maybe. I map identifiers to unique id's later, in the identifier AST
> node constructor, though.

In allocation scheme I proposed that ID could be a 32bit offset into the 
unique identifiers chunk. Identifiers themselves then would have to be 
stored with sentinels (e.g. zeros) at end like in C. Then the 'map' 
process comes for free.

The only overhead is actually copying the bytes to this buffer but it 
only happens once per unique identifier and in exchange you get to lex 
anything not only 'the whole module in one memory block' kind of thing.
On the plus side it's also more cache friendly.

Overall I think it's a good trade-off, but I never had the time to 
exercise it.

-- 
Dmitry Olshansky