D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Thu Mar 8 10:46:33 PST 2012

On Thursday, March 08, 2012 22:03:12 Dmitry Olshansky wrote:
> On 08.03.2012 11:48, Jonathan M Davis wrote:
> > A range is not necessarily a dynamic array, though a dynamic array is a
> > range. The lexer is going to need to take a range of dchar (which may or
> > may not be an array), and it's probably going to need to return a range
> > of tokens. The parser would then take a range of tokens and then output
> > the AST in some form or other - it probably couldn't be range, but I'm
> > not sure. And while the lexer would need to operate on generic ranges of
> > dchar, it would probably have to be special-cased for strings in a number
> > of places in order to make it faster (e.g. checking the first char in a
> > string rather than using front when it's known that the value being
> > checked against is an ASCII character and will therefore fit in a single
> > char - front has to decode the next character, which is less efficient).
> 
> Simply put, the decisison on decoding should belong to lexer. Thus
> strings should be wrapped as input range of char, wchar & dchar
> respectively.

??? The normal way to handle this is to simply special-case certain 
operations. e.g.

static if(Unqual!(isElementEncodingType!R) == char)
{ ... }
else
{ ... }

I'm not sure that wrapping char and wchar arrays in structs that treat them as 
ranges of char or wchar is a good idea. At minimum, I'm not aware of anything 
in Phobos currently working that way (unless you did something like that in 
std.regex?). Everything either treats them as generic ranges of dchar or 
special cases them. And when you want to be most efficient with string 
processing, I would think that you'd want to treat them exactly as the arrays 
of code units that they are rather than ranges - in which case treating them 
as generic ranges of dchar in most places and then special casing certain 
sections of code which can take advantage of the fact that they're arrays of 
code units seems like the way to go. The lexer is then choosing when something 
decodes, though the default is to decode, since it requires special-casing to 
avoid it.

- Jonathan M Davis