[GSoC’11] Lexing and parsing

Robert Jacques sandford at jhu.edu
Wed Mar 23 07:45:21 PDT 2011


On Wed, 23 Mar 2011 08:22:10 -0400, spir <denis.spir at gmail.com> wrote:
> On 03/23/2011 05:39 AM, Robert Jacques wrote:
>> On Tue, 22 Mar 2011 18:27:51 -0400, Ilya Pupatenko  
>> <pupatenko at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> First of all, I want to be polite so I have to introduce myself (you  
>>> can skip
>>> this paragraph if you feel tired of newcomer-students’ posts). My name  
>>> is
>>> Ilya, I’m a Master student of IT department of Novosibirsk State  
>>> University
>>> (Novosibirsk, Russia). In Soviet period Novosibirsk became on of the  
>>> most
>>> important science center in the country and now there are very close
>>> relations between University and Academy of Science. That’s why it’s
>>> difficult and very interesting to study here. But I’m not planning to  
>>> study
>>> or work this summer, so I’ll be able to work (nearly) full time on GSoC
>>> project. My primary specialization is seismic tomography inverse  
>>> problems,
>>> but I’m also interested in programming language implementation and
>>> compilation theory. I have good knowledge of C++ and C# languages and
>>> “intermediate” knowledge of D language, knowledge of compilation  
>>> theory, some
>>> experience in implementing lexers, parsers and translators, basic  
>>> knowledge
>>> of lex/yacc/antlr and some knowledge of Boost.Spirit library. I’m not  
>>> an
>>> expert in D now, but I willing to learn and to solve difficult tasks,  
>>> that’s
>>> why I decided to apply on the GSoC.
>>>
>>> I’m still working on my proposal (on task “Lexing and Parsing”), but I  
>>> want
>>> to write some general ideas and ask some questions.
>>>
>>> 1. It is said that “it is possible to write a highly-integrated  
>>> lexer/perser
>>> generator in D without resorting to additional tools”. As I  
>>> understand, the
>>> library should allow programmer to write grammar directly in D  
>>> (ideally, the
>>> syntax should be somehow similar to EBNF) and the resulting parser  
>>> will be
>>> generated by D compiler while compiling the program. This method allows
>>> integration of parsing in D code; it can make code simpler and even  
>>> sometimes
>>> more efficient.
>>> There is a library for C++ (named Boost.Spirit) that follows the same  
>>> idea.
>>> It provide (probably not ideal but very nice) “EBNF-like” syntax to  
>>> write a
>>> grammar, it’s quite powerful, fast and flexible. There are three parts  
>>> in
>>> this library (actually there are 4 parts but we’re not interested in
>>> Spirit.Classic now):
>>> • Spirit.Qi (parser library that allows to build recursive descent  
>>> parsers);
>>> • Spirit.Karma (generator library);
>>> • Spirit.Lex (library usable to create tokenizers).
>>> The Spirit library uses “C++ template black magic” heavily (for  
>>> example, via
>>> Boost.Fusion). But D has greater metaprogramming abilities, so it is  
>>> possible
>>> to implement the same functionality in easier and “clean” way.
>>> So, the question is: is it a good idea if at least parser library
>>> architecture will be somewhat similar to Spirit one? Of course it is  
>>> not
>>> about “blind” copying; but creating architecture for such a big system
>>> completely from scratch is quite difficult indeed. If to be exact, I  
>>> like an
>>> idea of parser attributes, I like the way semantic actions are  
>>> described, and
>>> the “auto-rules” seems really useful.
>>
>> I'm not qualified to speak on Spirits internal architecture; I've only  
>> used it
>> once for something very simple and ran into a one-liner bug which  
>> remains
>> unfixed 7+ years later. But the basic API of Spirit would be wrong for  
>> D. “it
>> is possible to write a highly-integrated lexer/perser generator in D  
>> without
>> resorting to additional tools” does not mean "the library should allow
>> programmer to write grammar directly in D (ideally, the syntax should be
>> somehow similar to EBNF)" it means that the library should allow you to  
>> write a
>> grammar in EBNF and then through a combination of templates, string  
>> mixins and
>> compile-time function evaluation generate the appropriate (hopefully  
>> optimal)
>> parser. D's compile-time programming abilities are strong enough to do  
>> the code
>> generation job usually left to separate tools. Ultimately a user of the  
>> library
>> should be able to declare a parser something like this:
>>
>> // Declare a parser for Wikipedia's EBNF sample language
>> Parser!`
>> (* a simple program syntax in EBNF − Wikipedia *)
>> program = 'PROGRAM' , white space , identifier , white space ,
>> 'BEGIN' , white space ,
>> { assignment , ";" , white space } ,
>> 'END.' ;
>> identifier = alphabetic character , { alphabetic character | digit } ;
>> number = [ "-" ] , digit , { digit } ;
>> string = '"' , { all characters − '"' } , '"' ;
>> assignment = identifier , ":=" , ( number | identifier | string ) ;
>> alphabetic character = "A" | "B" | "C" | "D" | "E" | "F" | "G"
>> | "H" | "I" | "J" | "K" | "L" | "M" | "N"
>> | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
>> | "V" | "W" | "X" | "Y" | "Z" ;
>> digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
>> white space = ? white space characters ? ;
>> all characters = ? all visible characters ? ;
>> ` wikiLangParser;
>
> How does one solve the issues of self/mutual/circular pattern recursion  
> at compile-time?
>
> Denis

How do you solve it at runtime? Then apply CTFE. Alternatively, how would  
you solve it with a functional language, then apply templates. I think you  
missed a point though; Parser generates a parser given a EBNF grammar. And  
therefore would internally behave like any other DSL -> code generation  
tool (except it would be a library).

P.S. self/mutual/circular pattern recursion occurs all the time in  
templates and CTFE.


More information about the Digitalmars-d mailing list