DDMD and such.

Dmitry Olshansky dmitry.olsh at gmail.com
Wed Sep 28 15:03:41 PDT 2011


On 29.09.2011 1:20, Nick Sabalausky wrote:
> "Jonathan M Davis"<jmdavisProg at gmx.com>  wrote in message
> news:mailman.271.1317243599.26225.digitalmars-d at puremagic.com...
>> On Wednesday, September 28, 2011 13:43 Nick Sabalausky wrote:
>>> "Jonathan M Davis"<jmdavisProg at gmx.com>  wrote in message
>>> news:mailman.261.1317239287.26225.digitalmars-d at puremagic.com...
>>>
>>>> I would point out that there is an intention to eventually get a D
>>>> lexer
>>>> and
>>>> parser into Phobos so that tools can take advantage of them. Those
>>>> could
>>>> eventually lead to a frontend in D but would provide benefits far
>>>> beyond
>>>> simply
>>>> having the compiler in D.
>>>
>>> Is the interest more in a D-specific lexer/parser or a generalized one?
>>> Or
>>> is it more of a split vote? I seem to remember interest both ways, but I
>>> don't know whether there's any consensus among the DMD/Phobos crew.
>>>
>>> A generalized lexer is nothing more than a regex engine that has more
>>> than
>>> one distinct accept state (which then gets run over and over until EOF).
>>> And the FSM is made simply by doing a combined regex "(regexForToken1 |
>>> regexForToken2 | regexForToken3 | ... )", and then each of those parts
>>> just get their own accept state. Which makes me wonder...
>>>
>>> There was a GSoC project to overhaul Phobos's regex engine, wasn't there?
>>> Is that done? Is it designed in a way that the stuff above wouldn't be
>>> real hard to add?
>>>
>>> And what about algoritm? Is it a Thompson NFA, ie, it traverses the NFA
>>> as
>>> if it were a DFA, effectively "creating" the DFA on-the-fly)? Or does it
>>> just traverse the NFA as an NFA? Or does it create an actual DFA and
>>> traverse that? An actual DFA would probably be best for a lexer. If a
>>> DFA,
>>> is it an optimized DFA? In my (limited) tests, it didn't seem like
>>> DFA-optimization would yield a notable benefit on typical
>>> programming-langauge tokens. It seems to be more suited to pathological
>>> cases.
>>
>> There is some desire to have a lexer and parser in Phobos which basically
>> have
>> the same implementation as dmd (only in D instead of C++). That way,
>> they're
>> very close to the actual compiler, and it's easy to port fixes and
>> improvements between the two.
>>
>> However, we definitely also want a more general lexer/parser generator
>> which
>> takes advantage of D's metaprogramming capabalities. Andrei was pushing
>> more
>> for that and doesn't really like the idea of the other, since it would
>> reduce
>> the desire to produce the more general solution. So, this _is_ some
>> dissension
>> on the matter. But there's definitely room for both. It's just a question
>> of
>> time and manpower.
>>
>
> Boy, I gotta say I'm really tempted to tackle this. I don't know if I
> *should* dedicate my already-tight time, but it's very tempting. And I have
> already written a generalized lexer generator in D (
> www.semitwist.com/goldie ), so I have that experience (and codebase) to draw
> upon.
>

Interesting and I almost forgot that we have lexer generator... What 
that "generalized" bit applies to? Does it tackle CFG? Then that would 
have been parser in my vocabulary ;)
Judging by first pages I see LALR(1) so definitely a parser.
I'm more into LL+something or PEGs. I'm liking the way e.g. ANTLR does 
this, a very nice hybrid approach.

> Only big question is whether it would be best to try to make Phobos's
> existing regex engine flexible enough that it could be used by the lexer
> (since a generalized lexer is essentially a regex engine with multiple
> accept states, and optionally some customizable hooks). I've posted some
> questions to that end in another branch of this thread.
>

To that end all what needs to be done is to restrict some wild stuff 
like backreferences & lookaround (how the hell thought that was good 
idea?!). Then use existing parser to get IR code for regex  per each 
alternative, then fuse them via thompson construction (keeping note of 
terminal states).
Taking Unicode into account I'd rather not go for table driven DFA. I'd 
better craft some switch statements and let the compiler sweat :)

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list