std.d.lexer requirements

Wed Aug 1 23:39:42 PDT 2012

On 8/1/2012 10:44 PM, Jonathan M Davis wrote:
> On Wednesday, August 01, 2012 22:33:12 Walter Bright wrote:
>> The lexer must use char or it will not be acceptable as anything but a toy
>> for performance reasons.
>
> Avoiding decoding can be done with strings and operating on ranges of dchar,
> so you'd be operating almost entirely on ASCII. Are you saying that there's a
> performance issue aside from decoding?

1. Encoding it into a dchar is a performance problem. Source code sits in files 
that are nearly always in UTF8. So your input range MUST check every single char 
and convert it to UTF32 as necessary. Plus, there's that additional step removed 
from sticking the file input buffer directly into the lexer's input.

2. Storing strings as dchars is a performance and memory problem (4x as much 
data and hence time).

Remember, nearly ALL of D source will be ASCII. All performance considerations 
must be tilted towards the usual case.

>> Somebody has to convert the input files into dchars, and then back into
>> chars. That blows for performance. Think billions and billions of
>> characters going through, not just a few random strings.
>
> Why is there any converting to dchar going on here?

Because your input range is a range of dchar?

> I don't see why any would
> be necessary. If you reading in a file as a string or char[] (as would be
> typical), then you're operating on a string, and then the only time that any
> decoding will be necessary is when you actually need to operate on a unicode
> character, which is very rare in D's grammar. It's only when operating on
> something _other_ than a string that you'd have to actually deal with dchars.

That's what I've been saying. So why have an input range of dchars, which must 
be decoded in advance, otherwise it wouldn't be a range of dchars?

>>> Hmmm. Well, I'd still argue that that's a parser thing. Pretty much
>>> nothing
>>> else will care about it. At most, it should be an optional feature of the
>>> lexer. But it certainly could be added that way.
>>
>> I hate to say "trust me on this", but if you don't, have a look at dmd's
>> lexer and how it handles identifiers, then look at dmd's symbol table.
>
> My point is that it's the sort of thing that _only_ a parser would care about.
> So, unless it _needs_ to be in the lexer for some reason, it shouldn't be.

I think you are misunderstanding. The lexer doesn't have a *symbol* table in it. 
It has a mapping from identifiers to unique handles. It needs to be there, 
otherwise the semantic analysis has to scan identifier strings a second time.