std.d.lexer requirements

Dmitry Olshansky dmitry.olsh at gmail.com
Thu Aug 2 13:52:10 PDT 2012


On 02-Aug-12 08:30, Walter Bright wrote:
> On 8/1/2012 8:04 PM, Jonathan M Davis wrote:
>> On Wednesday, August 01, 2012 17:10:07 Walter Bright wrote:
>>> 1. It should accept as input an input range of UTF8. I feel it is a
>>> mistake
>>> to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16
>>> should use an 'adapter' range to convert the input to UTF8. (This is
>>> what
>>> component programming is all about.)
>>
>> But that's not how ranges of characters work. They're ranges of dchar.
>> Ranges
>> don't operate on UTF-8 or UTF-16. They operate on UTF-32. You'd have
>> to create
>> special wrappers around string or wstring to have ranges of UTF-8. The
>> way
>> that it's normally done is to have ranges of dchar and then special-case
>> range-based functions for strings. Then the function can operate on
>> any range
>> of dchar but still operates on strings efficiently.
>
> I have argued against making ranges of characters dchars, because of
> performance reasons. This will especially adversely affect the
> performance of the lexer.
>
> The fact is, files are normally in UTF8 and just about everything else
> is in UTF8. Prematurely converting to UTF-32 is a performance disaster.
> Note that the single largest thing holding back regex performance is
> that premature conversion to dchar and back to char.

Well, it doesn't convert back to UTF-8 as it just slices of the input :)

Otherwise very true especially with ctRegex that used to recieve quite 
some hype even in its present state. 33% of time spent is doing and 
redoing UTF-8 decoding.
(Note that quite some extra work on top of what lexer does is done, e.g. 
lexer is largely deterministic but regex has some of try-rollback).

> If lexer is required to accept dchar ranges, its performance will drop
> at least in half, and people are going to go reinvent their own lexers.
>

Yes, it slows things down. Decoding (if any) should kick in only where 
it's absolutely necessary and be an integral part of lexer automation.


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list