DCT: D compiler as a collection of libraries

Sun May 20 11:37:07 PDT 2012

On Sunday, 20 May 2012 at 17:42:59 UTC, Marco Leise wrote:
> There is one feature I remember caused some head-aches for 
> Code::Blocks. They used a separate parser instance for every 
> project in the IDE, which meant that all the standard include 
> files would be parsed and kept in memory multiple times. When 
> they later switched to a common parser for all projects they 
> ran into stability issues. If you can arrange it, it would be 
> great for multi-project IDEs to be able to add and remove 
> projects to your parser without reparsing Phobos/druntime 
> (which may have its .di files replaced by .d files, looking at 
> the past discussion).
The opposite situation: I didn't think about parser per project :)
So I guess no need to worry here.

> C bindings could be an option. (As in: the smallest common 
> denominator.) They allow existing tools (written in Java, C#, 
> Python, ...) to use your library.
Yeah, but I'm far from there yet.

>> > Since assembly code is usually small I just preallocate an 
>> > array of sourceCode.length tokens and realloc it to the 
>> > correct size when I'm done parsing. Nothing pretty, but 
>> > simple and I am sure it won't get any faster ;).
>> I'm sure it will :) (I'm going to elaborate on this some time 
>> later).
>
> I'm curious.
Maybe I'm don't understand your use case, but the idea is that if
you parse as you type it should be possible to avoid parsing and
allocating memory for those lines which have not changed. But
that is not compatible with storing tokens in the array, since it
would cause reallocating memory each time, so the other data
structure should be used (e.g., a linked list, or, if efficient
lookup is needed, a red-black tree). Only benchmarks can show
whether (and how much) my approach would be faster for specific
situation (input patterns like average size of data, complexity
of parsing algorithms, requirements, etc.).

> I found that I usually either load from file into a newly 
> allocated buffer (where a copy occurs, only because I forgot 
> about assumeUnique in std.exception) or I am editing the file 
> in which case I recreate the source string after every key 
> stroke anyway. I can still pass slices of that string to 
> functions though. Not sure what you mean.
Answered below.

> It probably doesn't work for D as well as for ASM code, but I 
> could also check for \x1A and __EOF__ in the same fashion.
> (By the way, is it \x1A (substitute, ^Z) or did you mean \0x04 
> (end-of-transmission, ^D)?)
D has the following EoF cases: \0, \x1A, physical EoF and __EOF__
special token when not inside comment or some of string literals.
^D is not in this list.

> The way it works is: Parser states like 'in expression' can 
> safely peek at the next character at any time. If it doesn't 
> match what they expect they emit an error and drop back to the 
> "surrounding" parser state. When they reach the "file" level, 
> that's the only place where an EOF (which will only occur once 
> per file anyway) will be consumed.
> In theory one can drop all string length checks and work on 
> char* directly with a known terminator char that is distinct 
> from anything else.
If you want to pass a slice of input string to a function, you
cannot append \0 to it without copying data. If you don't append
some pre-defined character, you must check for length *and* all
supported terminating characters. On the contrary, your design
might not require passing slices, and if language syntax allows
deterministic parsing (when you know what to expect next), there
is no need for checking EoF.