std.d.lexer requirements

Wed Aug 1 22:33:12 PDT 2012

On 8/1/2012 9:52 PM, Jonathan M Davis wrote:
> 1. The current design of Phobos is to have ranges of dchar, because it fosters
> code correctness (though it can harm efficiency). It's arguably too late to do
> otherwise. Certainly, doing otherwise now would break a lot of code. If the
> lexer tried to operate on UTF-8 as part of its API rather than operating on
> ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos
> at all.

The lexer must use char or it will not be acceptable as anything but a toy for 
performance reasons.

> 2. The lexer does _not_ have to have its performance tank by accepting ranges
> of dchar. It's true that the performance will be harmed for ranges which
> _aren't_ strings, but for strings (as would be by far the most common use
> case) it can be very efficient by special-casing them.

Somebody has to convert the input files into dchars, and then back into chars. 
That blows for performance. Think billions and billions of characters going 
through, not just a few random strings.

Always always think of the lexer as having a firehose of data shoved into its 
maw, and it better be thirsty!

> And as much as there are potential performance issues with Phobos' choice of
> treating strings as ranges of dchar, if it were to continue to treat them as
> ranges of code units, it's pretty much a guarantee that there would be a _ton_
> of bugs caused by it. Most programmers have absolutely no clue about how
> unicode works and don't really want to know. They want strings to just work.
> Phobos' approach of defaulting to correct but making it possible to make the
> code faster through specializations is very much in line with D's typical
> approach of making things safe by default but allowing the programmer to do
> unsafe things for optimizations when they know what they're doing.

I expect std.d.lexer to handle UTF8 correctly, so I don't think this should be 
an issue in this particular case. dmd's lexer does handle UTF8 correctly.

Note also that the places where non-ASCII characters can appear in correct D 
code is severely limited, and there are even fewer places where multibyte 
characters need to be decoded at all, and the lexer takes full advantage of this 
to boost its speed.

For example, non-ASCII characters can appear in comments, but they DO NOT need 
to be decoded, and even just having the test for a non-ASCII character in the 
comment scanner will visibly slow down the lexer.

>> All identifiers are entered into a hashtable, and are referred to by
>> pointers into that hashtable for the rest of dmd. This makes symbol lookups
>> incredibly fast, as no string comparisons are done.
>
> Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing
> else will care about it. At most, it should be an optional feature of the
> lexer. But it certainly could be added that way.

I hate to say "trust me on this", but if you don't, have a look at dmd's lexer 
and how it handles identifiers, then look at dmd's symbol table.