Let's stop parser Hell

Sat Jul 7 23:03:21 PDT 2012

On Sunday, 8 July 2012 at 01:15:31 UTC, David Piepgrass wrote:
> I'm not seeing any tremendous error-handling difficulty in my 
> idea. Anyway, I missed the part about information being thrown 
> away...?

(I will use the word "you" to refer to some abstract person or 
compiler.)

Error handling could mean either error reporting and stopping 
after the first error, or reporting several errors and continuing 
parsing + semantic analysis whenever possible, so that the user 
would get partial functionality like code completion, information 
about method overloads, or "go to definition" / find all usages, 
etc.

The first method is the less powerful option, but still requires 
constructing a meaningful error message which could help the user.

There are many possible error recovery strategies. For example, 
you might decide to insert some token according to the parsing 
information available up to the moment when error is discovered, 
if that would fix the problem and allow to continue parsing. 
Another option is to ignore a series of further tokens (treat 
them a whitespace, for example), until parser is able to continue 
its work. Also there are many heuristics for typical error 
situations.

All of these can only be performed if parser gathers the syntax 
information which can be inferred from source code according to 
the grammar rules.

In stage 2 you have only performed some basic analysis, like, 
e.g., matched braces to define some hierarchy. This means that at 
the time when you find a missing brace, for example, you cannot 
tell anything more than that braces don't match. Or, if the user 
inserts a brace in an incorrect location, it is only possible to 
say that it is either missing somewhere and somewhere else 
another brace is missing, or that it is missing in one place, and 
some other brace is redundant. In many cases you won't even 
notice that a brace is incorrectly placed, and pass the resulting 
tree to the 3rd stage. You don't take any hint from grammar about 
the exact locations where some token should be present.

Now, stage 3 heavily depends on the output of stage 2. As I 
demonstrated in some examples, it could get the output which 
implies incorrect structure, even if that has not been found in 
the previous stage. You would need to analyze so much information 
attempting to find the real roots of a problem, that effectively 
it would involve duplicating (implicitly) the logic of previous 
stage, but with combinatorial complexity growth.

The problems you would need to deal with are much more complex 
than I described here. So you wouldn't be able to deliver error 
recovery at all, or (if you could) it would be either trivial or 
would require needlessly complex logic. Breaking the system at 
the wrong boundary (when the number of dependencies that cross 
the boundary is large) always introduces artificial complexity.

Described above is the illustration of what I call information 
loss. I may have described something not as clearly as needed, 
but I didn't have the goal to provide a thorough and verifiable 
analysis. I speculated and simplified a lot. If you decide to 
ignore this, it is not a problem and I don't state that you will 
fail any of your goals. Just admit that __for me__ this approach 
is not a viable option. Everything written above is IMHO and 
there may be many ways to resolve the problems with various 
degrees of success.