Building C++ modules

H. S. Teoh hsteoh at quickfur.ath.cx
Tue Aug 13 15:21:56 UTC 2019


On Tue, Aug 13, 2019 at 11:19:16AM +0200, Jacob Carlborg via Digitalmars-d wrote:
> On 2019-08-12 21:58, H. S. Teoh wrote:
> 
> > This is a big part of why C++'s
> > must-be-parsed-before-it-can-be-lexed syntax is a big hindrance to
> > meaningful progress.  The only way such a needlessly over-complex
> > syntax can be handled is a needlessly over-complex lexer/parser
> > combo, which necessarily results in needlessly over-complex corner
> > cases and other such gotchas.  Part of this nastiness is the poor
> > choice of template syntax (overloading '<' and '>' to be delimiters
> > in addition to their original roles of comparison operators), among
> > several other things.
> 
> I don't know how this is implemented in a C++ compiler but can't the
> lexer use a more abstract token that includes both the usage for
> templates and for comparison operators? The parser can then figure out
> exactly what it is.

It's not so simple.  The problem is that in C++, the *structure* of the
parse tree changes depending on previous declarations. I.e., the lexical
structure is not context-free.  For example, given this C++ code:

	int main() {
		A a;
		B b;

		// What do these lines do?
		fun<A, B>(a, b);
		gun<T, U>(a, b);
	}

What do you think the parse tree should be?

On the surface, it would appear that main() contains two variable
declarations, followed by calling two template functions with (a, b) as
the arguments.

Unfortunately, this is not true. The way the last two lines of main()
are parsed can be *wildly divergent* depending on what declarations came
before.  To see how this can be so, here's the full code (which I posted
a while back in a discussion on template syntax):

-----------------------------------snip------------------------------------
// Totally evil example of why C++ template syntax and free-for-all operator
// overloading is a Bad, Bad Idea.
#include <iostream>

struct Bad { };

struct B { };

struct A {
	Bad operator,(B b) { return Bad(); }
};

struct D { };

struct Ugly {
	D operator>(Bad b) { return D(); }
} U;

struct Terrible { } T;

struct Evil {
	~Evil() {
		std::cout << "Hard drive reformatted." << std::endl;
	}
};

struct Nasty {
	Evil operator,(D d) { return Evil(); }
};

struct Idea {
	void operator()(A a, B b) {
		std::cout << "Good idea, data saved." << std::endl;
	}
	Nasty operator<(Terrible t) { return Nasty(); }
} gun;

template<typename T, typename U>
void fun(A a, B b) {
	std::cout << "Have fun!" << std::endl;
}

int main() {
	A a;
	B b;

	// What do these lines do?
	fun<A, B>(a, b);
	gun<T, U>(a, b);
}
-----------------------------------snip------------------------------------


Note that `gun` is not a template, and not even a function. It's a
global struct instance with a completely-abusive series of operator
overloads.

While I admit that this example is contrived, it does prove my point
that you simply cannot parse C++ code in any straightforward way.  You
have to use nasty hacks in both the lexer and the parser just to get the
thing to parse at all, and this is not even touching the more pertinent
topic of C++ semantic analysis, which in many places is even worse
(SFINAE and Koenig Lookup, anyone? -- thanks to which, the meaning of
your code can change simply by adding an #include line at the top of the
file without touching anything else. Symbol hijacking galore!).


> DMD is doing something similar, but at a later stage. For example, in
> the following code snippet: "int a = foo;", "foo" is parsed as an
> identifier expression. Then the semantic analyzer figures out if "foo"
> is a function call or a variable.
[...]

There's no such thing as an 'identifier expression'. `foo` is parsed as
an expression, period.  The parse tree is pretty straightforward.  Of
course, there *is* a hack later that turns it into an implicit function
call, but that's already long past the parsing stage. Unlike the C++
example above, the *structure* of the parse tree doesn't change, just
the meaning of a leaf node.  You don't end up with a completely
unrelated parse tree structure just because of some strange declarations
elsewhere in the source code.


T

-- 
Windows 95 was a joke, and Windows 98 was the punchline.


More information about the Digitalmars-d mailing list