std.d.lexer requirements

Wed Aug 1 21:52:15 PDT 2012

On Wednesday, August 01, 2012 21:30:44 Walter Bright wrote:
> On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
> > On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
> >> 1. It should accept as input an input range of UTF8. I feel it is a
> >> mistake
> >> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
> >> should use an 'adapter' range to convert the input to UTF8. (This is what
> >> component programming is all about.)
> > 
> > But that's not how ranges of characters work. They're ranges of dchar.
> > Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd
> > have to create special wrappers around string or wstring to have ranges
> > of UTF-8. The way that it's normally done is to have ranges of dchar and
> > then special-case range-based functions for strings. Then the function
> > can operate on any range of dchar but still operates on strings
> > efficiently.
> 
> I have argued against making ranges of characters dchars, because of
> performance reasons. This will especially adversely affect the performance
> of the lexer.
> 
> The fact is, files are normally in UTF8 and just about everything else is in
> UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that
> the single largest thing holding back regex performance is that premature
> conversion to dchar and back to char.
> 
> If lexer is required to accept dchar ranges, its performance will drop at
> least in half, and people are going to go reinvent their own lexers.

1. The current design of Phobos is to have ranges of dchar, because it fosters 
code correctness (though it can harm efficiency). It's arguably too late to do 
otherwise. Certainly, doing otherwise now would break a lot of code. If the 
lexer tried to operate on UTF-8 as part of its API rather than operating on 
ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos 
at all.

2. The lexer does _not_ have to have its performance tank by accepting ranges 
of dchar. It's true that the performance will be harmed for ranges which 
_aren't_ strings, but for strings (as would be by far the most common use 
case) it can be very efficient by special-casing them.

And as much as there are potential performance issues with Phobos' choice of 
treating strings as ranges of dchar, if it were to continue to treat them as 
ranges of code units, it's pretty much a guarantee that there would be a _ton_ 
of bugs caused by it. Most programmers have absolutely no clue about how 
unicode works and don't really want to know. They want strings to just work. 
Phobos' approach of defaulting to correct but making it possible to make the 
code faster through specializations is very much in line with D's typical 
approach of making things safe by default but allowing the programmer to do 
unsafe things for optimizations when they know what they're doing.

> All identifiers are entered into a hashtable, and are referred to by
> pointers into that hashtable for the rest of dmd. This makes symbol lookups
> incredibly fast, as no string comparisons are done.

Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing 
else will care about it. At most, it should be an optional feature of the 
lexer. But it certainly could be added that way.

- Jonathan M Davis