Let's stop parser Hell

Dmitry Olshansky dmitry.olsh at gmail.com
Tue Jul 31 14:27:30 PDT 2012


On 31-Jul-12 20:10, Kai Meyer wrote:
> On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
>> There are more and more projects requiring parsing D code (IDE plugins,
>> DCT and dfmt now).
>>
>> We definitely need a _standard_ fast D parser (suitable as tokenizer).
>> We already have a good one at least in Visual D. Maybe dmd's parser is
>> faster. If so, it can be (easily?) rewritten in D. We also already have
>> some other parsers (in C# from Mono-D etc.).
>>
>> What about to get one and call it standard?
>>
>
> I know I'm a little late coming into this conversation. This seems like
> a nice thread to toss myself into. I've started working on a generic
> lexer that is based off of a defined grammar.
>
> As I read through the thread (I unfortunately don't have enough time to
> read every post, but I skimmed through as many as I could, and read the
> ones that seemed important), it seems like we need a parser in D that
> can lex D, and provide a Range of tokens that could be consumed.
>
> With some very minor tweaks, and a properly written Grammar class, I
> basically have it already done. D was going to be one of the first
> languages I would have written a definition for.
>
> https://github.com/kai4785/firstfront
>
> I haven't had time to look through Pegged, but there are some
> differences in my approach that I can immediately see in Pegged's.
>
> 1) Pegged does compile-time or run-time code generation for your parser.
> Mine is run-time only and regex based.
> 2) Pegged uses PEG format, which I have been introduced to through this
> thread. One of my goals was to actually invent something like PEG. This
> will save me time :)
>

> I would love to receive some critical feedback on my approach as well as
> the actual code base.

Okay. Critics go as follows:

First, as an author of std.regex I'm pleased that you find it suitable 
to use it in lexer.  :)

Still there is a BIG word of warning:

Lexer based on trying out an array of regexes until one will match is 
NOT fast and not even close to fast ones. It is acceptable as proof of 
concept/prototype kind of thing only.

To help this use case I though of making multi-regex matching a-la:
match(text, regex1, regex2, regex3...);
then  lexing would be effectively a loop of form:

foreach(tok; match(text, regex1, regex2, regex3...)){
	switch(tok[0]){//suppose match returns tuple - N of regex matched + the 
usual piece of text
		...
	}
}
(with some glitches on /+ ... +/ and other non-regular stuff).

But until such time it's not to be used seriously in this context.

The reason is that each call to match does have some overhead to setup 
regex engine, it's constant but it's huge compared to what usual lexers 
typically do. The other thing is that you should use ctRegex for this 
kind of job if you can (i.e. compiler doesn't explode on it).


(keep in mind I only glimpsed the source, will post more feedback later)

>
> To build it, you'll need cmake, and cmaked2 from here:
> http://code.google.com/p/cmaked2/wiki/GettingStarted
>
> Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
> don't see anything that immediately jumps out as not cross-platform.
> I've been working on it on both Fedora and CentOS.
>

rdmd for the win!


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list