DDMD and such.

Nick Sabalausky a at a.a
Wed Sep 28 15:22:54 PDT 2011


"Dmitry Olshansky" <dmitry.olsh at gmail.com> wrote in message 
news:j605ks$hut$1 at digitalmars.com...
> On 29.09.2011 1:20, Nick Sabalausky wrote:
>>
>> Boy, I gotta say I'm really tempted to tackle this. I don't know if I
>> *should* dedicate my already-tight time, but it's very tempting. And I 
>> have
>> already written a generalized lexer generator in D (
>> www.semitwist.com/goldie ), so I have that experience (and codebase) to 
>> draw
>> upon.
>>
>
> Interesting and I almost forgot that we have lexer generator... What that 
> "generalized" bit applies to? Does it tackle CFG? Then that would have 
> been parser in my vocabulary ;)
> Judging by first pages I see LALR(1) so definitely a parser.
> I'm more into LL+something or PEGs. I'm liking the way e.g. ANTLR does 
> this, a very nice hybrid approach.
>

It's both. It can do lexing and parsing, or just one or the other by 
themself.

The parsing is LALR(1), the lexing is compiled DFA taken from regular 
expressions (it's not traditional PCRE-syntax, but it's basically it's a 
regex).

>> Only big question is whether it would be best to try to make Phobos's
>> existing regex engine flexible enough that it could be used by the lexer
>> (since a generalized lexer is essentially a regex engine with multiple
>> accept states, and optionally some customizable hooks). I've posted some
>> questions to that end in another branch of this thread.
>>
>
> To that end all what needs to be done is to restrict some wild stuff like 
> backreferences & lookaround (how the hell thought that was good idea?!). 
> Then use existing parser to get IR code for regex  per each alternative, 
> then fuse them via thompson construction (keeping note of terminal 
> states).
> Taking Unicode into account I'd rather not go for table driven DFA. I'd 
> better craft some switch statements and let the compiler sweat :)
>

I'm using a span-based table of code units instead of just simply an array 
of code units. It seems to work fine on unicode (much better than a list of 
code units). But yea, you could probably do better by generating switches or 
something to be mixed-in. I was thinking of doing that, but haven't gotten 
around to it.






More information about the Digitalmars-d mailing list