std.d.lexer requirements

Sat Aug 4 03:04:53 PDT 2012

On 04-Aug-12 14:02, Christophe Travert wrote:
> Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit :
>> On Thursday, August 02, 2012 11:08:23 Walter Bright wrote:
>>> The tokens are not kept, correct. But the identifier strings, and the string
>>> literals, are kept, and if they are slices into the input buffer, then
>>> everything I said applies.
>>
>> String literals often _can't_ be slices unless you leave them in their
>> original state rather than giving the version that they translate to (e.g.
>> leaving \© in the string rather than replacing it with its actual,
>> unicode value). And since you're not going to be able to create the literal
>> using whatever type the range is unless it's a string of some variety, that
>> means that the literals often can't be slices, which - depending on the
>> implementation - would make it so that that they can't _ever_ be slices.
>>
>> Identifiers are a different story, since they don't have to be translated at
>> all, but regardless of whether keeping a slice would be better than creating a
>> new string, the identifier table will be far superior, since then you only need
>> one copy of each identifier. So, it ultimately doesn't make sense to use slices
>> in either case even without considering issues like them being spread across
>> memory.
>>
>> The only place that I'd expect a slice in a token is in the string which
>> represents the text which was lexed, and that won't normally be kept around.
>>
>> - Jonathan M Davis
>
> I thought it was not the lexer's job to process litterals. Just split
> the input in tokens, and provide minimal info: TokenType, line and col
> along with the representation from the input. That's enough for a syntax
> highlighting tools for example. Otherwise you'll end up doing complex
> interpretation and the lexer will not be that efficient. Litteral
> interpretation can be done in a second step. Do you think doing litteral
> interpretation separately when you need it would be less efficient?
>
Most likely - since you re-read the same memory twice to do it.

-- 
Dmitry Olshansky