std.d.lexer requirements
Jonathan M Davis
jmdavisProg at gmx.com
Wed Aug 1 21:52:15 PDT 2012
On Wednesday, August 01, 2012 21:30:44 Walter Bright wrote:
> On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
> > On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
> >> 1. It should accept as input an input range of UTF8. I feel it is a
> >> mistake
> >> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
> >> should use an 'adapter' range to convert the input to UTF8. (This is what
> >> component programming is all about.)
> >
> > But that's not how ranges of characters work. They're ranges of dchar.
> > Ranges don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd
> > have to create special wrappers around string or wstring to have ranges
> > of UTF-8. The way that it's normally done is to have ranges of dchar and
> > then special-case range-based functions for strings. Then the function
> > can operate on any range of dchar but still operates on strings
> > efficiently.
>
> I have argued against making ranges of characters dchars, because of
> performance reasons. This will especially adversely affect the performance
> of the lexer.
>
> The fact is, files are normally in UTF8 and just about everything else is in
> UTF8. Prematurely converting to UTF-32 is a performance disaster. Note that
> the single largest thing holding back regex performance is that premature
> conversion to dchar and back to char.
>
> If lexer is required to accept dchar ranges, its performance will drop at
> least in half, and people are going to go reinvent their own lexers.
1. The current design of Phobos is to have ranges of dchar, because it fosters
code correctness (though it can harm efficiency). It's arguably too late to do
otherwise. Certainly, doing otherwise now would break a lot of code. If the
lexer tried to operate on UTF-8 as part of its API rather than operating on
ranges of dchar and special-casing strings, then it wouldn't fit in with Phobos
at all.
2. The lexer does _not_ have to have its performance tank by accepting ranges
of dchar. It's true that the performance will be harmed for ranges which
_aren't_ strings, but for strings (as would be by far the most common use
case) it can be very efficient by special-casing them.
And as much as there are potential performance issues with Phobos' choice of
treating strings as ranges of dchar, if it were to continue to treat them as
ranges of code units, it's pretty much a guarantee that there would be a _ton_
of bugs caused by it. Most programmers have absolutely no clue about how
unicode works and don't really want to know. They want strings to just work.
Phobos' approach of defaulting to correct but making it possible to make the
code faster through specializations is very much in line with D's typical
approach of making things safe by default but allowing the programmer to do
unsafe things for optimizations when they know what they're doing.
> All identifiers are entered into a hashtable, and are referred to by
> pointers into that hashtable for the rest of dmd. This makes symbol lookups
> incredibly fast, as no string comparisons are done.
Hmmm. Well, I'd still argue that that's a parser thing. Pretty much nothing
else will care about it. At most, it should be an optional feature of the
lexer. But it certainly could be added that way.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list