DCT: D compiler as a collection of libraries

Mon May 14 09:30:21 PDT 2012

Le 14/05/2012 17:00, Roman D. Boiko a écrit :
> On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
>> I think you are wasting much more memory and performance by storing
>> all the tokens in the lexer.
>>
>> Imagine I want to implement a simple syntax highlighter: just
>> highlight keywords. How can I tell DCT to *not* store all tokens
>> because I need each one in turn? And since I'll be highlighting in the
>> editor I will need column and line information. That means I'll have
>> to do that O(log(n)) operation for every token.
>>
>> So you see, for the simplest use case of a lexer the performance of
>> DCT is awful.
>>
>> Now imagine I want to build an AST. Again, I consume the tokens one by
>> one, probably peeking in some cases. If I want to store line and
>> column information I just copy them to the AST. You say the tokens are
>> discarded but their data is not, and that's why their data is usually
>> copied.
>
> Currently I think about making token a class instead of struct.
>
> A token (from
> https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:
>
> // Represents lexed token
> struct Token
> {
> size_t startIndex; // position of the first code unit in the source string
> string spelling; // characters from which this token has been lexed
> TokenKind kind; // enum; each keyword and operator, have a dedicated kind
> ubyte annotations; // meta information like whether a token is valid, or
> an integer literal is signed, long, hexadecimal, etc.
> }
>
> Making it a class would give several benefits:
>
> * allow not to worry about allocating a big array of tokens. E.g., on
> 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes
> 13.5MB in an array of almost 500K tokens. It would require 4 times
> smaller chunk of contiguous memory if it was an array of class objects,
> because each would consume only 8 bytes instead of 32.
>

Why is this a benefice ?

> * allow subclassing, for example, for storing strongly typed literal
> values; this flexibility could also facilitate future extensibility (but
> it's difficult to predict which kind of extension may be needed)
>

I'm pretty sure that D's token will not change that much. If the need 
isn't identified right know, I'd advocate for YAGNI.

> * there would be no need to copy data from tokens into AST, passing an
> object would be enough (again, copy 8 instead of 32 bytes); the same
> applies to passing into methods - no need to pass by ref to minimise
> overhead
>

Yes but now you add pressure on the GC and add indirections. I'm not 
sure it worth it. It seems to me like a premature optimization.

> It would incur some additional memory overhead (at least 8 bytes per
> token), but that's hardly significant. Also there is additional price
> for accessing token members because of indirection, and, possibly, worse
> cache friendliness (token instances may be allocated anywhere in memory,
> not close to each other).
>
> These considerations are mostly about performance. I think there is also
> some impact on design, but couldn't find anything significant (given
> that currently I see a token as merely a datastructure without
> associated behavior).
>
> Could anybody suggest other pros and cons? Which option would you choose?

You are over engineering the whole stuff.