[GSOC Draft Proposal] ANTLR and Java based D parser for IDE usage

Tue Mar 29 08:04:53 PDT 2011

On 28/03/2011 01:52, Luca Boasso wrote:
> Sorry for my late draft proposal, I'm currently moving so I didn't
> have enough time this days.
> I would be glad to have your opinion.
>
> Thank you
>
> <DRAFT PROPOSAL>
>
> Rationale
> ---------
>
> There are different IDEs for the D programming language. The purpose of this
> project proposal is to write a parser for the D programming language (v1 and v2)
> that is tailored for IDEs needs. The new parser will be designed to be modular
> and abstracted from any particular IDE implementation detail, so that it can be
> used in different IDEs or with tools that need an abstract syntax tree of the
> source code for further analysis.
> Particular care will be taken to integrate the new parser with the DDT
> Eclipse-based IDE so that this project will be useful in the short-term.
 >
> The DDT project needs a new parser up-to-date with the latest D syntax, with
> a better error recovery and improved performance.
> Thanks to this integration it will be possible to understand the appropriate
> interface for the parser so that in the long-term the same code could be used in
> different projects.
>
> I will use the ANTLR parser generator for this project. This parser generator
> has been proven to be a valuable tool with advanced features for tree
> construction and  tree manipulations that cuts development time [1]. The LL(*)
> parsing algorithm  on which ANTLR is based upon allows efficient parsing of
> complex grammar with good error handling and unrestricted grammar actions [2].
>
> The use of a parser generator allows the creations of parsers in different
> programming languages. This project will focus on the creation of a Java parser.
> Since ANTLR support many target languages [3], it will not be so difficult to
> generate a parser in the original implementation language of the IDE.
> Eg. Generate a C++ parser for the D language that will be used in the IDE
> written in C++.
>
> Furthermore, updates of the D grammar are reflected in a more convenient way
> through modifications of the ANTLR grammar of D, than through a modification of
> a hand-written parser.
> In particular, one of the problems faced by DDT developers was to keep their
> parser up-to-date with the reference one (DMD parser) [4].
> It is time-consuming and error-prone to manually port the DMD parser written in
> C++ to another language, instead most of the modification will be handled by
> ANTLR.
>
> In addition, easy modification of the D language syntax encourages
> experimentation for the benefit of the language's evolution.
>
>
> Finally in the process of writing a new parser eventual misunderstanding or
> inconsistency of the D language reference and documentations will be addressed.
> A good set of test will be created to guarantee the compatibility of the new
> parser with the official language definition and the DMD parser.
>

Like Andrei said, and as is already mentioned in this proposal, I think 
the focus of this parser project should to integrate with DDT, so that 
we can have something directly useful at the conclusion of the project. 
And also to validate that the parser is worthwhile for IDE usage.
Fortunately this is not contrary to the other goals of making the 
grammar reusable for other ANTLR-based parsers coded in another 
language, or to make the D parser reusable in other Java-based projects.
The DDT AST classes (and the basic semantic engine) are already isolated 
in their own bundle/module, conceptually independent of any Eclipse code 
(there a few minor coded dependencies that are trivially removable).

The proposal text looks good to me, but one missing thing that I think 
is key to consider is error recovery. The current parser (Descent/DMD) 
is already fairly good at this, (although it could be improved in some 
regards). The new ANTLR parser would not need to be as good as DMD, but 
it should have good recovery at least in same basic IDE usage cases. So 
for example:

/* block structure stuff: */
void func() {
   blah(
}
// the parser should still recover successfully and parse the rest of // 
the file after func

Recovery inside statements, and some other use cases are also very 
important, but this can be discussed in more detail later, my point now 
is just that the consideration of the syntax recovery should be present 
to the proposal. (just mention it, no need to write much about it)

Some other comments relating to implementation and design details:

> Once I have got an overall understanding I will write the production
>    rules of the D grammar (v1 and v2) in the ANTLR grammar notation (similar to
>    EBNF).
>

Hum, I am inclined to think that having two separate grammars for each 
version of D is not the best approach. For starters, even for D2 there 
are not one, but many version of the language, even with regards to just 
parsing D2. True, we may choose to not support those previous versions, 
and focus only on D2 as of TDPL, but it is still important to be mindful 
of this. Also because there might be additions to the syntax of D2.
And here IDE development differs somewhat from a compiler. In a compiler 
you would just change the parser code to the latest version of the 
language. And so the latest compiler only supports the latest version of 
the language. However, in the IDE you ideally want the latest version of 
the IDE to support *all* previous versions of the language. Or at least 
all versions that users might still want to code in.
So it is better perhaps to have just one grammar that is a superset of 
D1 and D2, (and then afterward have some "syntax" validator on the 
AST/tokens to make sure it is valid to a given language version)

On 28/03/2011 01:52, Luca Boasso wrote:
 >    At this point, I need to discuss with the DDT team the type of AST 
that has to
 >    be built for IDEs purposes, and confirm which annotations are most 
useful
 >    (eg. source ranges).

As for the AST that should be generated, you can already see how it 
should (mostly) be, by looking here:
http://code.google.com/a/eclipselabs.org/p/ddt/source/browse/#hg%2Forg.dsource.ddt.dtool%2Fsrc%2Fdtool%2Fast
That AST is generally what the parser should generate, although minor 
adjustments and changes might be necessary or desirable, yes.
There are also some parser tests there, but they are very few and limited.

I also have some comments for the timeline but I'll leave that for 
another post.

-- 
Bruno Medeiros - Software Engineer