std.d.lexer requirements

Thu Aug 2 16:52:35 PDT 2012

On Thursday, August 02, 2012 19:30:47 Andrei Alexandrescu wrote:
> On 8/2/12 7:18 PM, Jonathan M Davis wrote:
> Your insights are always appreciated; even their Cliff notes :o).

LOL. Well, I'm not about to decide on the best approach to this without 
thinking through it more. What I've been doing manages to deal quite nicely 
with avoiding unnecessary decoding and still allows for the lexing of ranges 
of dchar which aren't strings (though there's obviously an efficiency hit 
there), and it really isn't complicated or messy thanks to some basic mixins 
that I've been using. Switching to operating specifically on code units and not 
accepting ranges of dchar at all has some serious ramifications, and I have to 
think through them all before I take a position on that.

> > but Walter seems to be arguing that that strings
> > should be treated as ranges of code units in general, which I think is
> > completely wrong.
> 
> I think Walter has very often emphasized the need for the lexer to be
> faster than the usual client software. My perception is that he's
> discussing lexer design in understanding there's a need for a less
> comfortable approach, namely do decoding in client.

That may be, but if he's arguing that strings should _always_ be treated as 
range of code units - as in all D programs, most of which don't have anything 
to do with lexers (other than when they're compiled) - then I'm definitely 
going to object to that, and it's my understanding that that's what he's 
arguing. But maybe I've misunderstood.

I've been arguing that strings should still be treated as ranges of code 
points and that that does not preclude making the lexer efficiently operate on 
code units when operating on strings even if it operates on ranges of dchar. I 
think that whether making the lexer operate on ranges of dchar but specialize 
on strings is a better approach or making it operate specifically on ranges of 
code units is a better approach depends on what we want it to be usable with. 
It should be just as fast with strings in either case, so it becomes a 
question of how we want to handle ranges which _aren't_ strings.

I suppose that we could make it operate on code units and just let ranges of 
dchar have UTF-32 as their code unit (since dchar is both a code unit and a 
code point), then ranges of dchar will still work but ranges of char and wchar 
will _also_ work. Hmmm. As I said, I'll have to think this through a bit.

- Jonathan M Davis