std.d.lexer requirements

Thu Aug 2 00:21:34 PDT 2012

On Wednesday, August 01, 2012 23:39:42 Walter Bright wrote:
> >> Somebody has to convert the input files into dchars, and then back into
> >> chars. That blows for performance. Think billions and billions of
> >> characters going through, not just a few random strings.
> > 
> > Why is there any converting to dchar going on here?
> 
> Because your input range is a range of dchar?

I think that we're misunderstanding each other here. A typical, well-written, 
range-based function which operates on ranges of dchar will use static if or 
overloads to special-case strings. This means that it will function with any 
range of dchar, but it _also_ will be as efficient with strings as if it just 
operated on strings. It won't decode anything in the string unless it has to. 
So, having a lexer which operates on ranges of dchar does _not_ make string 
processing less efficient. It just makes it so that it can _also_ operate on 
ranges of dchar which aren't strings.

For instance, my lexer uses this whenever it needs to get at the first 
character in the range:

static if(isNarrowString!R)
    Unqual!(ElementEncodingType!R) first = range[0];
else
    dchar first = range.front;

I use string mixins to reduce the code duplication that goes with this, but 
notice that strings and wstrings do _not_ get decoded. Their first code unit is 
used, and then most everything operates on that code unit. e.g.

switch(first)
{
    case '\n': //...
    case ' ': //...
    case '+' //...
    //...
    default: //...
}

It's only when the code actually needs to decode the first code point that it 
does so (e.g. when all of the possible ASCII characters have been exhausted, 
it needs to check whether the first code point is a unicode alpha character, 
since that could be the beginning of an identifier). At that point, I end up 
doing something like

static if(isNarrowString!R)
    dchar front = range.front;
else
    alias first front;

or

if I need to know the number of code units that make up the code point, I 
explicitly call decode in the case of a narrow string. In either case, code 
units are _not_ being converted to dchar unless they absolutely have to be.

The only thing that suffers here performance-wise is ranges that aren't 
actually strings. For instance, filter!"true"(source) would have _much_ worse 
performance, because it has to decode every character. But without the concept 
of a variably-lengthed encoded range, we can't really get around that.

> > My point is that it's the sort of thing that _only_ a parser would care
> > about. So, unless it _needs_ to be in the lexer for some reason, it
> > shouldn't be.
> I think you are misunderstanding. The lexer doesn't have a *symbol* table in
> it. It has a mapping from identifiers to unique handles. It needs to be
> there, otherwise the semantic analysis has to scan identifier strings a
> second time.

Yes. I understand. It has a mapping of pointers to identifiers. My point is 
that nothing but parsers will need that. From the standpoint of functionality, 
it's a parser feature, not a lexer feature. So, if it can be done just fine in 
the parser, then that's where it should be. If on the other hand, it _needs_ 
to be in the lexer for some reason (e.g. performance), then that's a reason to 
put it there.

It sounds like you're arguing that performance requirements dictate that it be 
put in the lexer even if nothing but the parser cares about it, in which case, 
the lexer is indeed where it should be, but if performance isn't affected, then 
I'd still argue that it should be in the parser.

- Jonathan M Davis