std.d.lexer requirements

Fri Aug 3 07:49:55 PDT 2012

deadalnix , dans le message (digitalmars.D:174155), a écrit :
>> The tokens are not kept, correct. But the identifier strings, and the
>> string literals, are kept, and if they are slices into the input buffer,
>> then everything I said applies.
>>
> 
> Ok, what do you think of that :
> 
> lexer can have a parameter that tell if it should build a table of token 
> or slice the input. The second is important, for instance for an IDE : 
> lexing will occur often, and you prefer slicing here because you already 
> have the source file in memory anyway.
> 
> The token always contains as a member a slice. The slice come either 
> from the source or from a memory chunk allocated by the lexer.

If I may add, there are several possitilities here:
 1- a real slice of the input range
 2- a slice of the input range created with .save and takeExactly
 3- a slice allocated in GC memory by the lexer
 4- a slice of memory owned by the lexer, which is reused for the next 
token (thus, the next call to popFront invalidates the token).
 5- a slice of memory from a lookup table.

All are useful in certain situations.
#1 is usable for sliceable ranges, and is definitely efficient when you 
don't have a huge amont of code to parse.
#2 is usable for forward ranges.
#3 is usable for any range, but I would not recommand it...
#4 is usable for any range,
#5 is best if you perform complicated operations with the tokens.

#1/#2 should not be very hard to code: when you start to lex a new 
token, you save the range, and when you found the end of the token, you 
just use takeExactly on the saved ranged.

#4 requires to use an internal buffer. That's more code, but you have 
to do them in a second step if you want to be able to use input range 
(which you have too). Actually, the buffer may be external, if you use a 
buffered-range adapter to make a forward range out of an input range. 
Having an internal buffer may be more efficient. That something that has 
to be profiled.

#3 can be obtained from #4 by map!(x => x.dup).

#5 requires one of the previous to be implemented. You need to have a 
slice saved somewhere before having a look at the look-up table. 
Therefore, I think #5 should be obtained without a high loss of 
efficiency by an algorithm external to the lexer. This would probably 
bring many ways to use the lexer. For example, you can filter-out many 
tokens that you don't want before building the table, which avoids to 
have an overfull look-up table if you are only interested in a subset of 
tokens.

#1/#2 with adapter ranges might be the only thing that is required to 
code, although the API should allow to define #4 and #5, for the 
user to use the adapters blindly, or if an internal implementation
proves to be significantly more efficient.

-- 
Christophe