DCT: D compiler as a collection of libraries

Sun May 20 01:09:34 PDT 2012

On Saturday, 19 May 2012 at 20:35:10 UTC, Marco Leise wrote:
> Am Fri, 11 May 2012 10:01:28 +0200
> schrieb "Roman D. Boiko" <rb at d-coding.com>:
>
>> There were several discussions about the need for a D compiler 
>> library.
>> 
>> I propose my draft implementation of lexer for community 
>> review:
>> https://github.com/roman-d-boiko/dct
(I decided to comment on both my post and your reply.)

I've got *a lot* of feedback from community, which caused a 
significant redesign (but caused a delay with yet unknown 
duration). I'll write more on that later. Thanks a lot to 
everybody!

>> Lexer is based on Brian Schott's project 
>> https://github.com/Hackerpilot/Dscanner, but it has been 
>> refactored and extended (and more changes are on the way).
Looks like I'm going to replace this implementation almost 
completely.

>> The goal is to have source code loading, lexer, parser and 
>> semantic analysis available as parts of Phobos. These 
>> libraries should be designed to be usable in multiple 
>> scenarios (e.g., refactoring, code analysis, etc.).
See below.

>> My commitment is to have at least front end built this year 
>> (and conforming to the D2 specification unless explicitly 
>> stated otherwise for some particular aspect).
By the D specification I mean 
http://dlang.org/language-reference.html, or (when specification 
is not clear for me) one of the following:
* TDPL,
* current DMD implementation
* community feedback.

(Note that I may assume that I understand some aspect, but later 
revisit it if needed.)

>> Please post any feed here. A dedicated project web-site will 
>> be created later.
It is going to be http://d-coding.com, but it is not usable yet 
(I have not much web-design practise). It is hosted on 
https://github.com/roman-d-boiko/roman-d-boiko.github.com, pull 
requests are welcome.

> A general purpose D front-end library has been discussed 
> several times. So at least for some day-dreaming about better 
> IDE support and code analysis tools it has a huge value. It is 
> good to see that someone finally takes the time to implement 
> one. My only concern is that not enough planing went into the 
> design.
Could you name a few specific concerns? The reason of this thread 
was to reflect on design early. OTOH, I didn't spend time 
documenting my design goals and tradeoffs, so discussion turned 
into a brainstorm and wasn't always productive. But now when I 
see how much value can I get even without doing my homework, I'm 
much more likely to provide better documentation and ask more 
specific questions.

>I think about things like brainstorming and collecting
> possible use cases from the community or looking at how some 
> C++ tools do their job and what infrastructure they are built 
> on. Although it is really difficult to tell from other people's 
> code which decision was 'important' and what was just the 
> author's way to do it.
I'm going to pick up several use cases and prioritize them 
according to my judgement. Feel free to suggest any cases that 
you think are needed (with motivation). Prioritizing is necessary 
to define what is out of scope and plan work into milestones, in 
order to ensure the project is feasible.

> Inclusion into Phobos I would not see as a priority. As 
> Jonathan said, there are already some clear visions of how such 
> modules would look like.
Well, *considering* such inclusion seriously would help improve 
the quality of project. But it is not what necessarily will 
happen. Currently I think that my goals are close to those of use 
case 2 from Jonathan's reply. But until the project is reasonably 
complete it is not the time to argue whether to include it (or 
its part).

> Also if any project seeks to replace the DMD front-end, Walter 
> should be the technical advisor. Like anyone putting a lot of 
> time and effort into a design, he could have strong feelings 
> about certain decisions and implementing them in a seemingly 
> worse way.
Not a goal currently, because that would make the project 
significantly less likely to be completed ever.

> That said, you make the impression of being really dedicated to 
> the project, even giving yourself a time line, which is a good 
> sign. I wish you all the best for the project and hope that - 
> even without any 'official' recognition - it will see a lot of 
> use as the base of D tools.
Well, I hope that some more people will join the project once it 
stabilizes and delivers something useful.

By the way, is anybody *potentially* interested to join?

> To learn about parsing I wrote a syntax highlighter for the 
> DCPU-16 assembly of Minecraft author Markus Persson. (Which was 
> a good exercise.) Interestingly I ended up with a similar 
> design for the Token struct like yours:
> - separate array for line # lookup
> - TokenType/TokenKind enum
> - Trie for matching token kinds (why do you expand it into 
> nested switch-case through CTFE mixins though?)
Switch code generation is something temporary that works 
currently and helps me evaluate possible problems with future 
design, which is planned to be compile-time finite automata 
(likely deterministic).

> Since assembly code is usually small I just preallocate an 
> array of sourceCode.length tokens and realloc it to the correct 
> size when I'm done parsing. Nothing pretty, but simple and I am 
> sure it won't get any faster ;).
I'm sure it will :) (I'm going to elaborate on this some time 
later).

> One notable difference is that I only check for isEoF in one 
> location, since I append "\0" to the source code as a stop 
> token (that can also serve as an end-of-line indicator). (The 
> general idea is to move checks outside of loops.)
There are several EoF conditions: \0, \x1A, __EOF__ and physicas 
EOF. And any loop would need to check for all. Preallocation 
could eliminate the need to check for physical EoF, but would 
make it impossible to avoid input string copying and also would 
not allow passing a function a slice of string to be lexed.

> ** Errors  **
> I generally keep the start and end column, in case someone 
> wants that. A real world example:
>
>   ubyte x = ...;
>   if (x >= 0 && x <= 8) ...
>   Line 2, warning: Comparison is always true.
>
> At first glace you say "No, a byte can become larger than 8.", 
> but the compiler is just complaining about the first part of 
> the expression here, which would be clarified by e.g. some 
> ASCII/Unicode art:
>
>   Line 2, warning: Comparison is always true:
>   if (x >= 0 && x <= 8) ...
>       ^----^
This functionality has been the priority from the beginning. Not 
implemented yet but designed for. Evaluation of column and line 
only on demand is caused by the assumption that such information 
is needed rarely (primarily to display information to the user). 
My new design also addresses the use cases when it is needed 
frequently.

> ** Code highlighting **
> Especially Gtk's TextView (which GtkSourceView is base on) 
> isn't made for several thousand tokens. The text view will take 
> a second per 20000 tokens or so. The source view works around 
> that by highlighting in chunks, but you can still fell the 
> delay when you jump to the end of the file in gedit and even 
> MonoDevelop suffers from the bad performance. If I understood 
> your posts correctly, you are already planning for minimal 
> change sets. It is *very* important to only update changed 
> parts in a syntax highlighting editor. So for now I ended up 
> writing a 'insertText' and 'deleteText' method for the parser 
> that take 'start index', 'text' and 'start index', 'end index' 
> respectively and return a list of removed and added tokens.
> If possible, make sure that your library can be used with 
> GtkSourceView, Scintilla and QSyntaxHighlighter. Unfortunately 
> the bindings for the last two could use an update, but the API 
> help should get you an idea of how to use it most efficiently.
Curretly I plan to support GtkSourceView, Scintilla, and a custom 
editor API which I plan to define so that it would be most 
efficient in this respect. Didn't work on that thoroughly yet.

Incremental changes are the key to efficiency, and I'm going to 
invest a lot of effort into making them. Also immutability of 
data structures will enable many optimizations.