Let's stop parser Hell

Mon Aug 6 10:05:55 PDT 2012

On 07/31/2012 03:27 PM, Dmitry Olshansky wrote:
> On 31-Jul-12 20:10, Kai Meyer wrote:
>> On 07/05/2012 06:11 AM, Denis Shelomovskij wrote:
>>> There are more and more projects requiring parsing D code (IDE plugins,
>>> DCT and dfmt now).
>>>
>>> We definitely need a _standard_ fast D parser (suitable as tokenizer).
>>> We already have a good one at least in Visual D. Maybe dmd's parser is
>>> faster. If so, it can be (easily?) rewritten in D. We also already have
>>> some other parsers (in C# from Mono-D etc.).
>>>
>>> What about to get one and call it standard?
>>>
>>
>> I know I'm a little late coming into this conversation. This seems like
>> a nice thread to toss myself into. I've started working on a generic
>> lexer that is based off of a defined grammar.
>>
>> As I read through the thread (I unfortunately don't have enough time to
>> read every post, but I skimmed through as many as I could, and read the
>> ones that seemed important), it seems like we need a parser in D that
>> can lex D, and provide a Range of tokens that could be consumed.
>>
>> With some very minor tweaks, and a properly written Grammar class, I
>> basically have it already done. D was going to be one of the first
>> languages I would have written a definition for.
>>
>> https://github.com/kai4785/firstfront
>>
>> I haven't had time to look through Pegged, but there are some
>> differences in my approach that I can immediately see in Pegged's.
>>
>> 1) Pegged does compile-time or run-time code generation for your parser.
>> Mine is run-time only and regex based.
>> 2) Pegged uses PEG format, which I have been introduced to through this
>> thread. One of my goals was to actually invent something like PEG. This
>> will save me time :)
>>
>
>> I would love to receive some critical feedback on my approach as well as
>> the actual code base.
>
> Okay. Critics go as follows:
>
> First, as an author of std.regex I'm pleased that you find it suitable
> to use it in lexer. :)
>
> Still there is a BIG word of warning:
>
> Lexer based on trying out an array of regexes until one will match is
> NOT fast and not even close to fast ones. It is acceptable as proof of
> concept/prototype kind of thing only.
>
> To help this use case I though of making multi-regex matching a-la:
> match(text, regex1, regex2, regex3...);
> then lexing would be effectively a loop of form:
>
> foreach(tok; match(text, regex1, regex2, regex3...)){
> switch(tok[0]){//suppose match returns tuple - N of regex matched + the
> usual piece of text
> ...
> }
> }
> (with some glitches on /+ ... +/ and other non-regular stuff).
>
> But until such time it's not to be used seriously in this context.
>
> The reason is that each call to match does have some overhead to setup
> regex engine, it's constant but it's huge compared to what usual lexers
> typically do. The other thing is that you should use ctRegex for this
> kind of job if you can (i.e. compiler doesn't explode on it).
>
>
> (keep in mind I only glimpsed the source, will post more feedback later)
>
>>
>> To build it, you'll need cmake, and cmaked2 from here:
>> http://code.google.com/p/cmaked2/wiki/GettingStarted
>>
>> Or just dmd *.d :) I haven't tried to build it on Windows yet, but I
>> don't see anything that immediately jumps out as not cross-platform.
>> I've been working on it on both Fedora and CentOS.
>>
>
> rdmd for the win!
>
>

I cut my teeth on perl, so everything gets compared to perl in my mind.

In perl, I can 'precompile' a regular expression, so I can avoid some 
overhead. So something like this:

while(<STDIN>){
     if($_ =~ m/<someregex>/){
         some work;
     }
}

Would end up re-compiling the regex on each line from STDIN. Perl uses 
the term "compiling" the regular expression, which may be different than 
what you call "setup regex engine".

Does/Can D's std.regex offer something like this? If not, I would be 
interested in why not.

-Kai Meyer