[GSoC’11] Lexing and parsing

Tue Mar 22 21:39:09 PDT 2011

On Tue, 22 Mar 2011 18:27:51 -0400, Ilya Pupatenko <pupatenko at gmail.com>  
wrote:

> Hi,
>
> First of all, I want to be polite so I have to introduce myself (you can  
> skip this paragraph if you feel tired of newcomer-students’ posts). My  
> name is Ilya, I’m a Master student of IT department of Novosibirsk State  
> University (Novosibirsk, Russia). In Soviet period Novosibirsk became on  
> of the most important science center in the country and now there are  
> very close relations between University and Academy of Science. That’s  
> why it’s difficult and very interesting to study here. But I’m not  
> planning to study or work this summer, so I’ll be able to work (nearly)  
> full time on GSoC project. My primary specialization is seismic  
> tomography inverse problems, but I’m also interested in programming  
> language implementation and compilation theory. I have good knowledge of  
> C++ and C# languages and “intermediate” knowledge of D language,  
> knowledge of compilation theory, some experience in implementing lexers,  
> parsers and translators, basic knowledge of lex/yacc/antlr and some  
> knowledge of Boost.Spirit library. I’m not an expert in D now, but I  
> willing to learn and to solve difficult tasks, that’s why I decided to  
> apply on the GSoC.
>
> I’m still working on my proposal (on task “Lexing and Parsing”), but I  
> want to write some general ideas and ask some questions.
>
> 1. It is said that “it is possible to write a highly-integrated  
> lexer/perser generator in D without resorting to additional tools”. As I  
> understand, the library should allow programmer to write grammar  
> directly in D (ideally, the syntax should be somehow similar to EBNF)  
> and the resulting parser will be generated by D compiler while compiling  
> the program. This method allows integration of parsing in D code; it can  
> make code simpler and even sometimes more efficient.
> There is a library for C++ (named Boost.Spirit) that follows the same  
> idea. It provide (probably not ideal but very nice) “EBNF-like” syntax  
> to write a grammar, it’s quite powerful, fast and flexible. There are  
> three parts in this library (actually there are 4 parts but we’re not  
> interested in Spirit.Classic now):
> • Spirit.Qi (parser library that allows to build recursive descent  
> parsers);
> • Spirit.Karma (generator library);
> • Spirit.Lex (library usable to create tokenizers).
> The Spirit library uses “C++ template black magic” heavily (for example,  
> via Boost.Fusion). But D has greater metaprogramming abilities, so it is  
> possible to implement the same functionality in easier and “clean” way.
> So, the question is: is it a good idea if at least parser library  
> architecture will be somewhat similar to Spirit one? Of course it is not  
> about “blind” copying; but creating architecture for such a big system  
> completely from scratch is quite difficult indeed. If to be exact, I  
> like an idea of parser attributes, I like the way semantic actions are  
> described, and the “auto-rules” seems really useful.

I'm not qualified to speak on Spirits internal architecture; I've only  
used it once for something very simple and ran into a one-liner bug which  
remains unfixed 7+ years later. But the basic API of Spirit would be wrong  
for D. “it is possible to write a highly-integrated lexer/perser generator  
in D without resorting to additional tools” does not mean "the library  
should allow programmer to write grammar directly in D (ideally, the  
syntax should be somehow similar to EBNF)" it means that the library  
should allow you to write a grammar in EBNF and then through a combination  
of templates, string mixins and compile-time function evaluation generate  
the appropriate (hopefully optimal) parser. D's compile-time programming  
abilities are strong enough to do the code generation job usually left to  
separate tools. Ultimately a user of the library should be able to declare  
a parser something like this:

// Declare a parser for Wikipedia's EBNF sample language
Parser!`
(* a simple program syntax in EBNF − Wikipedia *)
program = 'PROGRAM' , white space , identifier , white space ,
            'BEGIN' , white space ,
            { assignment , ";" , white space } ,
            'END.' ;
identifier = alphabetic character , { alphabetic character | digit } ;
number = [ "-" ] , digit , { digit } ;
string = '"' , { all characters − '"' } , '"' ;
assignment = identifier , ":=" , ( number | identifier | string ) ;
alphabetic character = "A" | "B" | "C" | "D" | "E" | "F" | "G"
                      | "H" | "I" | "J" | "K" | "L" | "M" | "N"
                      | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
                      | "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
white space = ? white space characters ? ;
all characters = ? all visible characters ? ;
` wikiLangParser;