DCT: D compiler as a collection of libraries

Mon May 14 08:00:36 PDT 2012

On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
> I think you are wasting much more memory and performance by 
> storing all the tokens in the lexer.
>
> Imagine I want to implement a simple syntax highlighter: just 
> highlight keywords. How can I tell DCT to *not* store all 
> tokens because I need each one in turn? And since I'll be 
> highlighting in the editor I will need column and line 
> information. That means I'll have to do that O(log(n)) 
> operation for every token.
>
> So you see, for the simplest use case of a lexer the 
> performance of DCT is awful.
>
> Now imagine I want to build an AST. Again, I consume the tokens 
> one by one, probably peeking in some cases. If I want to store 
> line and column information I just copy them to the AST. You 
> say the tokens are discarded but their data is not, and that's 
> why their data is usually copied.

Currently I think about making token a class instead of struct.

A token (from 
https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:

// Represents lexed token
struct Token
{
     size_t startIndex; // position of the first code unit in the 
source string
     string spelling; // characters from which this token has been 
lexed
     TokenKind kind; // enum; each keyword and operator, have a 
dedicated kind
     ubyte annotations; // meta information like whether a token 
is valid, or an integer literal is signed, long, hexadecimal, etc.
}

Making it a class would give several benefits:

* allow not to worry about allocating a big array of tokens. 
E.g., on 64-bit OS the largest module in Phobos (IIRC, the 
std.datetime) consumes 13.5MB in an array of almost 500K tokens. 
It would require 4 times smaller chunk of contiguous memory if it 
was an array of class objects, because each would consume only 8 
bytes instead of 32.

* allow subclassing, for example, for storing strongly typed 
literal values; this flexibility could also facilitate future 
extensibility (but it's difficult to predict which kind of 
extension may be needed)

* there would be no need to copy data from tokens into AST, 
passing an object would be enough (again, copy 8 instead of 32 
bytes); the same applies to passing into methods - no need to 
pass by ref to minimise overhead

It would incur some additional memory overhead (at least 8 bytes 
per token), but that's hardly significant. Also there is 
additional price for accessing token members because of 
indirection, and, possibly, worse cache friendliness (token 
instances may be allocated anywhere in memory, not close to each 
other).

These considerations are mostly about performance. I think there 
is also some impact on design, but couldn't find anything 
significant (given that currently I see a token as merely a 
datastructure without associated behavior).

Could anybody suggest other pros and cons? Which option would you 
choose?