Let's stop parser Hell
Roman D. Boiko
rb at d-coding.com
Sat Jul 7 23:03:21 PDT 2012
On Sunday, 8 July 2012 at 01:15:31 UTC, David Piepgrass wrote:
> I'm not seeing any tremendous error-handling difficulty in my
> idea. Anyway, I missed the part about information being thrown
> away...?
(I will use the word "you" to refer to some abstract person or
compiler.)
Error handling could mean either error reporting and stopping
after the first error, or reporting several errors and continuing
parsing + semantic analysis whenever possible, so that the user
would get partial functionality like code completion, information
about method overloads, or "go to definition" / find all usages,
etc.
The first method is the less powerful option, but still requires
constructing a meaningful error message which could help the user.
There are many possible error recovery strategies. For example,
you might decide to insert some token according to the parsing
information available up to the moment when error is discovered,
if that would fix the problem and allow to continue parsing.
Another option is to ignore a series of further tokens (treat
them a whitespace, for example), until parser is able to continue
its work. Also there are many heuristics for typical error
situations.
All of these can only be performed if parser gathers the syntax
information which can be inferred from source code according to
the grammar rules.
In stage 2 you have only performed some basic analysis, like,
e.g., matched braces to define some hierarchy. This means that at
the time when you find a missing brace, for example, you cannot
tell anything more than that braces don't match. Or, if the user
inserts a brace in an incorrect location, it is only possible to
say that it is either missing somewhere and somewhere else
another brace is missing, or that it is missing in one place, and
some other brace is redundant. In many cases you won't even
notice that a brace is incorrectly placed, and pass the resulting
tree to the 3rd stage. You don't take any hint from grammar about
the exact locations where some token should be present.
Now, stage 3 heavily depends on the output of stage 2. As I
demonstrated in some examples, it could get the output which
implies incorrect structure, even if that has not been found in
the previous stage. You would need to analyze so much information
attempting to find the real roots of a problem, that effectively
it would involve duplicating (implicitly) the logic of previous
stage, but with combinatorial complexity growth.
The problems you would need to deal with are much more complex
than I described here. So you wouldn't be able to deliver error
recovery at all, or (if you could) it would be either trivial or
would require needlessly complex logic. Breaking the system at
the wrong boundary (when the number of dependencies that cross
the boundary is large) always introduces artificial complexity.
Described above is the illustration of what I call information
loss. I may have described something not as clearly as needed,
but I didn't have the goal to provide a thorough and verifiable
analysis. I speculated and simplified a lot. If you decide to
ignore this, it is not a problem and I don't state that you will
fail any of your goals. Just admit that __for me__ this approach
is not a viable option. Everything written above is IMHO and
there may be many ways to resolve the problems with various
degrees of success.
More information about the Digitalmars-d
mailing list