[SAOC 2023] dfmt rewrite - Weekly update #1

Sun Sep 24 21:11:34 UTC 2023

On Friday, 22 September 2023 at 07:12:02 UTC, Prajwal S N wrote:
> Hi everyone,
>
> For SAOC 2023, I'm working on refactoring 
> [dfmt](https://github.com/dlang-community/dfmt) to use the AST 
> from DMD-as-a-library instead of libdparse.
>
> The past week has been very interesting. I got up to speed with 
> the dfmt codebase, and managed to do a 1-to-1 port of the lexer 
> dependency from libdparse to DMD-as-a-library. Most parts were 
> pretty straightforward, and the bulk of the work was replacing 
> every `tok!"<token>"` instance with `TOK.<token>` and making 
> sure the token coming from DMD was the same as what was 
> previously being used. So far so good!
>
> You can see the draft PR tracking the work 
> [here](https://github.com/dlang-community/dfmt/pull/589).
>
> Going forward, my mentor and I have decided that it would be 
> impractical to try and replace the parser directly, for 
> multiple reasons:
>
> - It's a lot of work to replace the parser and use the DMD AST 
> instead of libdparse's, and all of this work will happen 
> without a working version of dfmt. If, at the end of this, dfmt 
> is broken or refuses to compile, it could very well mean that 
> all that effort went down the drain.
> - Doing a brute force replacement of the parser will prevent us 
> from testing the transformation passes in dfmt individually, 
> and also brings us back to the point above.
>
> Hence, we've decided to do an incremental rewrite of the files 
> that use the parser, initially with no passes (just to ensure 
> the AST is being built in the first place), and then adding 
> each pass along with relevant unit tests.

I don't want to sound alarming or anything, but an AST is not 
really what you want to work with as a formatter.

The main reason is that you want to carry around a lot of 
information that the AST generally doesn't care about (comments, 
informations about layout, etc...). Consider the following 
example:

```d
int a; // this is an int.
int b;
```

We immediately recognize that the comment refers to a. However:

```d
int a;

// this is an int.
int b;
```

Now we recognize that the comment refers to b.

There is a lot of subtle semantic in there that is very hard to 
convey through an AST and are very hard to work with in that form.

There is a lot of prior art on the matter of code formatting, and 
the best explanation is probably the one from dartfmt's author: 
https://journal.stuffwithstuff.com/2015/09/08/the-hardest-program-ive-ever-written/ . clang-format and many others do use this approach.

Shameless plug: sdfmt uses that approach. You can get it there: 
https://code.dlang.org/packages/sdc%3Asdfmt .

I understand this is probably out of scope to turn things around 
at this time, but holly hell, do we really need, as a community, 
to redo all the mistake other communities have done instead of 
learning from them, and, to add insult to injury, involve junior 
devs in that madness?