Looking for champion - std.lang.d.lex

Mon Oct 25 19:44:38 PDT 2010

"Walter Bright" <newshound2 at digitalmars.com> wrote in message 
news:ia59si$1r0j$1 at digitalmars.com...
>
> Consider a string literal, say "abc\"def". With Goldie's method, I infer 
> this string has to be scanned twice. Once to find its limits, and the 
> second to convert it to the actual string.

Yea, that is true. With that string in the input, the value given to the 
user code will be:

assert(tokenObtainedFromGoldie.toString() == q{"abc\"def"});

That's a consequence of the grammar being separated from lexing/parsing 
implementation.

You're right that that does seem less than ideal. Although I'm not sure how 
to remedy that without loosing the independence between grammar and 
lex/parse implementation that is the main point of the GOLD-based style.

But there's something I don't quite understand about the approach you're 
suggesting: You seem to be suggesting that a terminal be progressively 
converted into its final form *as* it's still in the process of being 
recognized by the DFA. Which means, you don't know *what* you're supposed to 
be converting it into *while* you're converting it. Which means, you have to 
be speculatively converting it into all types of tokens that the current DFA 
state could possibly be on its way towards accepting (also, the DFA would 
need to contain a record of possible terminals for each DFA state). And then 
the result is thrown away if it turns out to be a different terminal. Is 
this correct? If so, is there generally enough lexical difference between 
the terminals that need such treatment to compensate for the extra 
processing needed in situations that are closer to worst-case (that is, in 
comparison to Goldie's current approach)?

If all of that is so, then what would be your thoughts on this approach?:

Suppose Goldie had a way to associate an optional "simultaneous/lockstep 
conversion" to a type of terminal. For instance:

myLanguage.associateConversion("StringLiteral", new 
StringLiteralConverter());

Then, 'StringLiteralConverter' would be something that could be either 
user-provided or offered by Goldie (both ways would be supported). It would 
be some sort of class or something that had three basic functions:

class StringLiteralConverter : ITerminalConverter
{
    void process(dchar c) {...}

    // Or maybe this to make it possible to minimize allocations
    // in certain circumstances by utilizing slices:
    void process(dchar c, size_t indexIntoSource, string fullOrignalSource) 
{...}

    Variant emit() {...}
    void clear() {...}
}

Each state in the lexer's DFA would know which terminals it could possibly 
be processing. And for each of those terminals that has an associated 
converter, the lexer will call 'process()'. If a terminal is accepted, 
'emit' is called to get the final result (and maybe do any needed 
finalization first), and then 'clear' is called on all converters that had 
been used.

This feature would preclude the use of the actual "GOLD Parser Builder" 
program, but since I'm writing a tool to handle that functionality anyway, 
I'm not too concerned about that.

Do you think that would work? Would its benefits be killed by the overhead 
introduced? If so, could those overheads be sufficiently reduced without 
scrapping the general idea?

>
> What Goldie will be compared against is Spirit. Spirit is a reasonably 
> successful add-on to C++. Goldie doesn't have to do things the same way as 
> Spirit (expression templates - ugh), but it should be as easy to use and 
> at least as powerful.
>

Understood.

>
>> Personally, I'd much rather have a clutter of source files than a 
>> cluttered source file. (But of course, I don't go to Java extremes and 
>> put *every* tiny little thing in a separate file.) As long as the 
>> complexity of having multiple files isn't passed along to user code 
>> (hence the frequent "module foo.all" idiom), then I can't say I really 
>> see a problem.
>
> I tend to just not like having to constantly grep to see which file XXX is 
> in.

Diff'rent strokes, I guess. I've only ever had that problem with Tango, 
which seems to kinda follow from the Java-STD-lib school of API design (no 
offense intended, Tango guys). But if I'm working on something that involves 
different sections of a codebase, which is very frequent, then I find it to 
be quite a pain to constantly scroll all around instead of just Ctrl-Tabbing 
between open files in different tabs.