std.d.lexer requirements

Thu Aug 2 01:38:25 PDT 2012

On 8/2/2012 12:21 AM, Jonathan M Davis wrote:
>> Because your input range is a range of dchar?
> I think that we're misunderstanding each other here. A typical, well-written,
> range-based function which operates on ranges of dchar will use static if or
> overloads to special-case strings. This means that it will function with any
> range of dchar, but it _also_ will be as efficient with strings as if it just
> operated on strings.

It *still* must convert UTF8 to dchars before presenting them to the consumer of 
the dchar elements.

> It won't decode anything in the string unless it has to.
> So, having a lexer which operates on ranges of dchar does _not_ make string
> processing less efficient. It just makes it so that it can _also_ operate on
> ranges of dchar which aren't strings.
>
> For instance, my lexer uses this whenever it needs to get at the first
> character in the range:
>
> static if(isNarrowString!R)
>      Unqual!(ElementEncodingType!R) first = range[0];
> else
>      dchar first = range.front;

You're requiring a random access input range that has random access to something 
other than the range element type?? and you're requiring an isNarrowString to 
work on an arbitrary range?

> if I need to know the number of code units that make up the code point, I
> explicitly call decode in the case of a narrow string. In either case, code
> units are _not_ being converted to dchar unless they absolutely have to be.

Or you could do away with requiring a special range type and just have it be a 
UTF8 range.

What I wasn't realizing earlier was that you were positing a range type that has 
two different kinds of elements. I don't think this is a proper component type.

> Yes. I understand. It has a mapping of pointers to identifiers. My point is
> that nothing but parsers will need that.
> From the standpoint of functionality,
> it's a parser feature, not a lexer feature. So, if it can be done just fine in
> the parser, then that's where it should be. If on the other hand, it _needs_
> to be in the lexer for some reason (e.g. performance), then that's a reason to
> put it there.

If you take it out of the lexer, then:

1. the lexer must allocate storage for every identifier, rather than only for 
unique identifiers

2. and then the parser must scan the identifier string *again*

3. there must be two hash lookups of each identifier rather than one

It's a suboptimal design.