std.d.lexer requirements

deadalnix deadalnix at gmail.com
Fri Aug 3 06:18:33 PDT 2012


Le 02/08/2012 20:08, Walter Bright a écrit :
> On 8/2/2012 4:52 AM, deadalnix wrote:
>> Le 02/08/2012 09:30, Walter Bright a écrit :
>>> On 8/1/2012 11:49 PM, Jacob Carlborg wrote:
>>>> On 2012-08-02 02:10, Walter Bright wrote:
>>>>
>>>>> 1. It should accept as input an input range of UTF8. I feel it is a
>>>>> mistake to templatize it for UTF16 and UTF32. Anyone desiring to
>>>>> feed it
>>>>> UTF16 should use an 'adapter' range to convert the input to UTF8.
>>>>> (This
>>>>> is what component programming is all about.)
>>>>
>>>> I'm no expert on ranges but won't that prevent slicing? Slicing is one
>>>> of the
>>>> main reasons for why the Tango XML parser is so amazingly fast.
>>>>
>>>
>>> You don't want to use slicing on the lexer. The reason is that your
>>> slices will be spread all over memory, as source files can be huge, and
>>> all that memory will be retained and never released. What you want is a
>>> compact representation after lexing. Compactness also helps a lot with
>>> memory caching.
>>>
>>
>> Token are not kept in memory. You usually consume them for other
>> processing and
>> throw them away.
>>
>> It isn't an issue.
>
> The tokens are not kept, correct. But the identifier strings, and the
> string literals, are kept, and if they are slices into the input buffer,
> then everything I said applies.
>

Ok, what do you think of that :

lexer can have a parameter that tell if it should build a table of token 
or slice the input. The second is important, for instance for an IDE : 
lexing will occur often, and you prefer slicing here because you already 
have the source file in memory anyway.

The token always contains as a member a slice. The slice come either 
from the source or from a memory chunk allocated by the lexer.

If the lexer allocate chunks, it will reuse the same memory location for 
the same string. Considering the following mecanism to compare slice, 
this will require 2 comparaisons for identifier lexed with that method :

if(a.length != b.length) return false;
if(a.ptr == b.ptr) return true;
// Regular char by char comparison.

Is that a suitable option ?


More information about the Digitalmars-d mailing list