DCT: D compiler as a collection of libraries
Roman D. Boiko
rb at d-coding.com
Fri May 11 02:22:25 PDT 2012
On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote:
> On 2012-05-11 10:58, Roman D. Boiko wrote:
>> Each token contains:
>> * start index (position in the original encoding, 0 corresponds
>> to the first code unit after BOM),
>> * token value encoded as UTF-8 string,
>> * token kind (e.g., token.kind = TokenKind.Float),
>> * possibly enum with annotations (e.g., token.annotations =
>> FloatAnnotation.Hex | FloatAnnotation.Real)
>
> What about line and column information?
Indices of the first code unit of each line are stored inside
lexer and a function will compute Location (line number, column
number, file specification) for any index. This way size of Token
instance is reduced to the minimum. It is assumed that Location
can be computed on demand, and is not needed frequently. So
column is calculated by reverse walk till previous end of line,
etc. Locations will possible to calculate both taking into
account special token sequences (e.g., #line 3 "ab/c.d"), or
discarding them.
>>> * Does it convert numerical literals and similar to their
>>> actual values
>> It is planned to add a post-processor for that as part of
>> parser,
>> please see README.md for some more details.
>
> Isn't that a job for the lexer?
That might be done in lexer for efficiency reasons (to avoid
lexing token value again). But separating this into a dedicated
post-processing phase leads to a much cleaner design (IMO), also
suitable for uses when such values are not needed. Also I don't
think that performance would be improved given the ratio of
number of literals to total number of tokens and the need to
store additional information per token if it is done in lexer. I
will elaborate on that later.
More information about the Digitalmars-d-announce
mailing list