[GSOC Draft Proposal] ANTLR and Java based D parser for IDE usage

Wed Mar 30 21:20:39 PDT 2011

Hello,
I will move from California to Italy tomorrow so I will not be able to
read new posts.
As soon as I arrive home I'll read and respond to any reviews.

Thank you

Luca Boasso

On 3/29/11, Luca Boasso <luke.boasso at gmail.com> wrote:
> Hello,
>
> thank you very much for your useful comments.
>
> I have updated the proposal version in the www.google-melange.com.
> I post here the changes and the updated version.
>
> Changes:
> - There are different IDEs for the D programming language. The purpose of
> this
> project proposal is to write a parser for a superset of the D programming
> language (including v1 and v2) that is tailored for IDEs needs.
>
> -Specifically, a good error recovery strategy is an essential feature in an
> IDE.
> The parser should be able to restore itself to a state where processing of
> the
> input can continue with reasonable hopes that further processing will
> provide
> meaningful diagnostic information.
>
> -Once I have got an overall understanding I will write the production rules
> of
> a superset of the D grammar in the ANTLR grammar notation (similar to
> EBNF).
>
>
> PROPOSAL
>
> ANTLR and Java based D parser for IDE usage
> ===========================================
>
> Luca Boasso
>
>
>
> Abstract
> --------
>
> The project aims to implement an ANTLR parser for the D programming language
> and
> the consequent integration with the DDT Eclipse-based IDE.
> The parser will be designed to be reused in other IDEs or tools.
>
>
> Rationale
> ---------
>
> There are different IDEs for the D programming language. The purpose of
> this
> project proposal is to write a parser for a superset of the D programming
> language (including v1 and v2) that is tailored for IDEs needs.
> The new parser will be designed to be modular and abstracted from any
> particular IDE implementation detail, so that it can be used in different
> IDEs
> or with tools that need an abstract syntax tree of the source code for
> further
> analysis.
>
> Particular care will be taken to integrate the new parser with the DDT
> Eclipse-based IDE so that this project will be useful in the short-term.
> The DDT project needs a new parser up-to-date with the latest D syntax,
> with
> a better error recovery and improved performance [0].
> Specifically, a good error recovery strategy is an essential feature in an
> IDE.
> The parser should be able to restore itself to a state where processing of
> the
> input can continue with reasonable hopes that further processing will
> provide
> meaningful diagnostic information.
>
> Thanks to this integration it will be possible to understand the
> appropriate
> interface for the parser so that in the long-term the same code could be
> used
> in different projects.
>
> I will use the ANTLR parser generator for this project. This parser
> generator
> has been proven to be a valuable tool with advanced features for tree
> construction and tree manipulations that cuts development time [1]. The
> LL(*)
> parsing algorithm on which ANTLR is based upon allows efficient parsing of
> complex grammar with good error handling and unrestricted grammar actions
> [2].
>
> The use of a parser generator allows the creations of parsers in different
> programming languages.
> This project will focus on the creation of a Java parser. Since ANTLR
> support
> many target languages [3], it will not be so difficult to  generate a parser
> in
> the original implementation language of the IDE. Eg. Generate a C++ parser
> for
> the D language that will be used in the IDE  written in C++.
>
> Furthermore, updates of the D grammar are reflected in a more convenient
> way
> through modifications of the ANTLR grammar of D, than through a modification
> of
> a hand-written parser.
> In particular, one of the problems faced by DDT developers was to keep
> their
> parser up-to-date with the reference one (DMD parser) [4].
> It is time-consuming and error-prone to manually port and mantain the DMD
> parser
> written in C++ to another language, instead most of the modification will
> be
> handled by ANTLR.
>
> In addition, easy modification of the D language syntax encourages
> experimentation for the benefit of the language's evolution.
>
> Finally in the process of writing a new parser eventual misunderstanding or
> inconsistency of the D language reference and documentations will be
> addressed.
>
> A good set of test will be created to guarantee the compatibility of the
> new
> parser with the official language definition and the DMD parser.
>
>
> Timeline
> --------
>
> This is a tentative timeline to be further discussed with the help of the
> community.
> I am committed to dedicate substantially to this project knowing that I
> also
> have to pass some exams.
> I estimate that I could spend initially approximately 30h/week.
> After the exam session I will work full-time on this project.
>
> - April 25 – May 23: Community Bonding Period
>   Since I am new in the D community I will spend some time learning how to
>   contribute while following the guideline of the community and the
> DDT project.
>   I will start reviewing the language reference asking for clarifications
>   when needed.
>   Once I have got an overall understanding I will write the production rules
> of
>   a superset of the D grammar in the ANTLR grammar notation (similar to
> EBNF).
>
> - May 23 – July 11: Developing phase I
>   The correctness of the parser is of paramount importance.
>   I will create many tests to exercise the parser (at this point just a
>   "recognizer") obtained as output from ANTLR.
>   Once I am confident with the parser conforms to the language reference
> and
>   recognizes the same language as the parser in DMD, I will enhance it with
> AST
>   construction rules.
>   At this point, I need to discuss with the DDT team the type of AST that
> has to
>   be built for IDEs purposes, and confirm which annotations are most useful
>   (eg. source ranges).
>
> - July 11 – August 15: Developing phase II
>   In this phase I will create unit tests to verify the correctness of the
>   generated trees and I will focus on the integration of the parser with the
> DDT
>   project.
>   In the remaining time I will provide good error recovery to the parser and
> I
>   will improve the overall performance.
>
> - August 15 - August 22: Final phase
>   I will use this last week to polish the code and improve the
> documentation.
>   As a final task, I will think about how support for incremental parsing
> can be
>   added in the future.
>
>
> About me
> --------
>
> I am Luca Boasso, I am a computer engineering student of Polytechnic
> University
> of Turin [5].
> I won a scholarship for a double-degree program and obtained the first
> master
> in France under Telecom ParisTech [6] in 2010.
> After having an internship in Panasonic R&D in Cupertino [7], I am
> currently
> finishing my last semester in Italy to receive my second master.
> I have been always had a passion for programming language design and
> implementations.
> I study and use in my spare time several programming languages.
> I have become a recent supporter of the D programming language and I am
> amazed
> by its expressiveness, power and net improvement over C++.
> In my master thesis, I designed a domain specific language for SAP lab
> France
> and I wrote several parsers by hand and also with the help of ANTLR.
>
>
> Contact Information
> -------------------
>
> Skype: luca_378
> Email: luke.boasso at gmail.com
>
>
> References
> ----------
>
> [0]   http://code.google.com/a/eclipselabs.org/p/ddt/wiki/DevelopmentGuide
> [1]   http://www.antlr.org/why.html
> [2]   http://www.antlr.org/papers/LL-star-PLDI11.pdf
> [3]   http://www.antlr.org/wiki/display/ANTLR3/Code+Generation+Targets
> [4]
> http://www.digitalmars.com/d/archives/digitalmars/D/ide/Future_of_Descent_
>       and_D_Eclipse_IDE_635.html
> [5]
> https://secure.wikimedia.org/wikipedia/en/wiki/Polytechnic_University_of
>       _Turin
> [6]   https://secure.wikimedia.org/wikipedia/en/wiki/TELECOM_ParisTech
> [7]   http://www.research.panasonic.com/
>
>
>
>
> On 3/29/11, Bruno Medeiros <brunodomedeiros+spam at com.gmail> wrote:
>>
>> On 28/03/2011 01:52, Luca Boasso wrote:
>>> Sorry for my late draft proposal, I'm currently moving so I didn't
>>> have enough time this days.
>>> I would be glad to have your opinion.
>>>
>>> Thank you
>>>
>>> <DRAFT PROPOSAL>
>>>
>>> Rationale
>>> ---------
>>>
>>> There are different IDEs for the D programming language. The purpose of
>>> this
>>> project proposal is to write a parser for the D programming language (v1
>>> and v2)
>>> that is tailored for IDEs needs. The new parser will be designed to be
>>> modular
>>> and abstracted from any particular IDE implementation detail, so that it
>>> can be
>>> used in different IDEs or with tools that need an abstract syntax tree
>>> of
>>> the
>>> source code for further analysis.
>>> Particular care will be taken to integrate the new parser with the DDT
>>> Eclipse-based IDE so that this project will be useful in the short-term.
>>  >
>>> The DDT project needs a new parser up-to-date with the latest D syntax,
>>> with
>>> a better error recovery and improved performance.
>>> Thanks to this integration it will be possible to understand the
>>> appropriate
>>> interface for the parser so that in the long-term the same code could be
>>> used in
>>> different projects.
>>>
>>> I will use the ANTLR parser generator for this project. This parser
>>> generator
>>> has been proven to be a valuable tool with advanced features for tree
>>> construction and  tree manipulations that cuts development time [1]. The
>>> LL(*)
>>> parsing algorithm  on which ANTLR is based upon allows efficient parsing
>>> of
>>> complex grammar with good error handling and unrestricted grammar
>>> actions
>>> [2].
>>>
>>> The use of a parser generator allows the creations of parsers in
>>> different
>>> programming languages. This project will focus on the creation of a Java
>>> parser.
>>> Since ANTLR support many target languages [3], it will not be so
>>> difficult
>>> to
>>> generate a parser in the original implementation language of the IDE.
>>> Eg. Generate a C++ parser for the D language that will be used in the
>>> IDE
>>> written in C++.
>>>
>>> Furthermore, updates of the D grammar are reflected in a more convenient
>>> way
>>> through modifications of the ANTLR grammar of D, than through a
>>> modification of
>>> a hand-written parser.
>>> In particular, one of the problems faced by DDT developers was to keep
>>> their
>>> parser up-to-date with the reference one (DMD parser) [4].
>>> It is time-consuming and error-prone to manually port the DMD parser
>>> written in
>>> C++ to another language, instead most of the modification will be
>>> handled
>>> by
>>> ANTLR.
>>>
>>> In addition, easy modification of the D language syntax encourages
>>> experimentation for the benefit of the language's evolution.
>>>
>>>
>>> Finally in the process of writing a new parser eventual misunderstanding
>>> or
>>> inconsistency of the D language reference and documentations will be
>>> addressed.
>>> A good set of test will be created to guarantee the compatibility of the
>>> new
>>> parser with the official language definition and the DMD parser.
>>>
>>
>>
>> Like Andrei said, and as is already mentioned in this proposal, I think
>> the focus of this parser project should to integrate with DDT, so that
>> we can have something directly useful at the conclusion of the project.
>> And also to validate that the parser is worthwhile for IDE usage.
>> Fortunately this is not contrary to the other goals of making the
>> grammar reusable for other ANTLR-based parsers coded in another
>> language, or to make the D parser reusable in other Java-based projects.
>> The DDT AST classes (and the basic semantic engine) are already isolated
>> in their own bundle/module, conceptually independent of any Eclipse code
>> (there a few minor coded dependencies that are trivially removable).
>>
>>
>>
>> The proposal text looks good to me, but one missing thing that I think
>> is key to consider is error recovery. The current parser (Descent/DMD)
>> is already fairly good at this, (although it could be improved in some
>> regards). The new ANTLR parser would not need to be as good as DMD, but
>> it should have good recovery at least in same basic IDE usage cases. So
>> for example:
>>
>> /* block structure stuff: */
>> void func() {
>>    blah(
>> }
>> // the parser should still recover successfully and parse the rest of //
>> the file after func
>>
>> Recovery inside statements, and some other use cases are also very
>> important, but this can be discussed in more detail later, my point now
>> is just that the consideration of the syntax recovery should be present
>> to the proposal. (just mention it, no need to write much about it)
>>
>>
>> Some other comments relating to implementation and design details:
>>
>>> Once I have got an overall understanding I will write the production
>>>    rules of the D grammar (v1 and v2) in the ANTLR grammar notation
>>> (similar to
>>>    EBNF).
>>>
>>
>> Hum, I am inclined to think that having two separate grammars for each
>> version of D is not the best approach. For starters, even for D2 there
>> are not one, but many version of the language, even with regards to just
>> parsing D2. True, we may choose to not support those previous versions,
>> and focus only on D2 as of TDPL, but it is still important to be mindful
>> of this. Also because there might be additions to the syntax of D2.
>> And here IDE development differs somewhat from a compiler. In a compiler
>> you would just change the parser code to the latest version of the
>> language. And so the latest compiler only supports the latest version of
>> the language. However, in the IDE you ideally want the latest version of
>> the IDE to support *all* previous versions of the language. Or at least
>> all versions that users might still want to code in.
>> So it is better perhaps to have just one grammar that is a superset of
>> D1 and D2, (and then afterward have some "syntax" validator on the
>> AST/tokens to make sure it is valid to a given language version)
>>
>>
>>
>> On 28/03/2011 01:52, Luca Boasso wrote:
>>  >    At this point, I need to discuss with the DDT team the type of AST
>> that has to
>>  >    be built for IDEs purposes, and confirm which annotations are most
>> useful
>>  >    (eg. source ranges).
>>
>> As for the AST that should be generated, you can already see how it
>> should (mostly) be, by looking here:
>> http://code.google.com/a/eclipselabs.org/p/ddt/source/browse/#hg%2Forg.dsource.ddt.dtool%2Fsrc%2Fdtool%2Fast
>> That AST is generally what the parser should generate, although minor
>> adjustments and changes might be necessary or desirable, yes.
>> There are also some parser tests there, but they are very few and
>> limited.
>>
>>
>> I also have some comments for the timeline but I'll leave that for
>> another post.
>>
>> --
>> Bruno Medeiros - Software Engineer
>>
>