D port of dmd: Lexer, Parser, AND CodeGenerator fully operational

Thu Mar 8 12:54:48 PST 2012

On 08.03.2012 22:46, Jonathan M Davis wrote:
> On Thursday, March 08, 2012 22:03:12 Dmitry Olshansky wrote:
>> On 08.03.2012 11:48, Jonathan M Davis wrote:
>>> A range is not necessarily a dynamic array, though a dynamic array is a
>>> range. The lexer is going to need to take a range of dchar (which may or
>>> may not be an array), and it's probably going to need to return a range
>>> of tokens. The parser would then take a range of tokens and then output
>>> the AST in some form or other - it probably couldn't be range, but I'm
>>> not sure. And while the lexer would need to operate on generic ranges of
>>> dchar, it would probably have to be special-cased for strings in a number
>>> of places in order to make it faster (e.g. checking the first char in a
>>> string rather than using front when it's known that the value being
>>> checked against is an ASCII character and will therefore fit in a single
>>> char - front has to decode the next character, which is less efficient).
>>
>> Simply put, the decisison on decoding should belong to lexer. Thus
>> strings should be wrapped as input range of char, wchar&  dchar
>> respectively.
>
> ??? The normal way to handle this is to simply special-case certain
> operations. e.g.
>
> static if(Unqual!(isElementEncodingType!R) == char)
> { ... }

Does isElementEncodingType aware of anything other then string/wstring?

> else
> { ... }
>
> I'm not sure that wrapping char and wchar arrays in structs that treat them as
> ranges of char or wchar is a good idea. At minimum, I'm not aware of anything
> in Phobos currently working that way (unless you did something like that in
> std.regex?).

Well it's not that I'm insisting of _wrapping_ the arrays or some such 
in general sense.  And yes, I had some hard experience in std.regex with 
UTF decoding and range design that doesn't exactly fits.

What I'm truly against is going overly generic and automatically 
stomping on your performance. That being said the general design of 
string ranges has unnessary pessimization already as it's popFront does 
a UTF length lookup, identical operation is  performed when decoding the 
first codepoint (.front).
At any rate building lexer on top of ranges in current setting means 
using abstraction that _hides_ decoding inside. That's a bad idea, it's 
loosing a fight from the start. Why? Because in this case range can't 
know if decoding is needed at this particular state of lexer or not, it 
is generality that kills this by definition.

Yeah, in general auto-magically decoding ranges are OK, as long as the 
stuff you do has cost much higher then decoding (~10 times would be 
fine) or things that you strictly can't do otherwise. Lexer doesn't fall 
into this criteria.

Everything either treats them as generic ranges of dchar or
> special cases them. And when you want to be most efficient with string
> processing, I would think that you'd want to treat them exactly as the arrays
> of code units that they are rather than ranges - in which case treating them
> as generic ranges of dchar in most places and then special casing certain
> sections of code which can take advantage of the fact that they're arrays of
> code units seems like the way to go.

Yeah, no arguing that. The thing is that lexer as a whole is precisely 
one of these special cases. It's good as long as it's fast and that 
requires more control then "a generic input range of dchar".

Now, speaking outside of this specific problem.
Basically I would propose formalizing a kind of range that current 
string/wstring is. And that is a VariableLengthEncoding range (VLE 
range), a two in one - random access codeunit range and bidirectional 
'codepoint' range. I've heard of attempts on this concept before, but 
now with a use case at hand it could be become more important.
The problem is, I think, that current InputRange range is insufficent as 
it requres to calculate length of first element twice: one time in front 
and extra one in popFront.

-- 
Dmitry Olshansky