DCT: D compiler as a collection of libraries

Fri May 11 02:22:25 PDT 2012

On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote:
> On 2012-05-11 10:58, Roman D. Boiko wrote:
>> Each token contains:
>> * start index (position in the original encoding, 0 corresponds
>> to the first code unit after BOM),
>> * token value encoded as UTF-8 string,
>> * token kind (e.g., token.kind = TokenKind.Float),
>> * possibly enum with annotations (e.g., token.annotations =
>> FloatAnnotation.Hex | FloatAnnotation.Real)
>
> What about line and column information?
Indices of the first code unit of each line are stored inside 
lexer and a function will compute Location (line number, column 
number, file specification) for any index. This way size of Token 
instance is reduced to the minimum. It is assumed that Location 
can be computed on demand, and is not needed frequently. So 
column is calculated by reverse walk till previous end of line, 
etc. Locations will possible to calculate both taking into 
account special token sequences (e.g., #line 3 "ab/c.d"), or 
discarding them.

>>> * Does it convert numerical literals and similar to their 
>>> actual values
>> It is planned to add a post-processor for that as part of 
>> parser,
>> please see README.md for some more details.
>
> Isn't that a job for the lexer?
That might be done in lexer for efficiency reasons (to avoid 
lexing token value again). But separating this into a dedicated 
post-processing phase leads to a much cleaner design (IMO), also 
suitable for uses when such values are not needed. Also I don't 
think that performance would be improved given the ratio of 
number of literals to total number of tokens and the need to 
store additional information per token if it is done in lexer. I 
will elaborate on that later.