DCT: D compiler as a collection of libraries

Fri May 11 05:38:23 PDT 2012

On Friday, 11 May 2012 at 11:50:29 UTC, Ary Manzana wrote:
> On 5/11/12 4:22 PM, Roman D. Boiko wrote:
>>> What about line and column information?
>> Indices of the first code unit of each line are stored inside 
>> lexer and
>> a function will compute Location (line number, column number, 
>> file
>> specification) for any index. This way size of Token instance 
>> is reduced
>> to the minimum. It is assumed that Location can be computed on 
>> demand,
>> and is not needed frequently. So column is calculated by 
>> reverse walk
>> till previous end of line, etc. Locations will possible to 
>> calculate
>> both taking into account special token sequences (e.g., #line 3
>> "ab/c.d"), or discarding them.
>
> But then how do you do to efficiently (if reverse walk is any 
> efficient) compute line numbers?
I borrowed the trick from Brian Schott: tokens will be stored as
a sorted array. Sorting is done based on their indices in source
code (they are naturally sorted immediately, no need to run any
algorithm for that). The same is true for information about line
start indices.

Now given an index we can do a binary search in such sorted
arrays, and get corresponding line number or token (whatever we
need). Walking along utf code points starting from the index
which corresponds to beginning of line and taking into account
tab characters it is possible to calculate column reasonably fast.

My assumption is that column numbers are needed for use cases
like error reporting or other messages for a user, and thus there
is no need to pre-calculate them for each token (or character).
For such use cases efficiency should be really enough.

In general, I precalculate and store only information which is
needed frequently. Other information is evaluated lazily.

> Usually tokens are used and discarded. I mean, somebody that 
> uses the lexer asks tokens, process them (for example to 
> highlight code or to build an AST) and then discards them. So 
> you can reuse the same Token instance. If you want to peek the 
> next token, or have a buffer of token, you can use a freelist ( 
> http://dlang.org/memory.html#freelists , one of the many nice 
> things I learned by looking at DMD's source code ).
But the information from tokens is not discarded (at least, this
is the requirement for DCT). So my choice is to keep it in tokens
instead of converting to some other form. That also implies that
Token is free from any auxiliary information which is not
necessary for common use cases.

> So adding line and column information is not like wasting a lot 
> of memory: just 8 bytes more for each token in the freelist.
It is inefficient to calculate it during lexing, scanning
algorithm become less robust and more complicated, and in most
cases we won't need it anyway. I will add a hook to plug-in such
functionality (when needed) if I will know why it is necessary.