dxml behavior after exception: continue parsing

Wed May 9 16:03:29 UTC 2018

On Tuesday, May 08, 2018 16:18:40 Jesse Phillips via Digitalmars-d-learn 
wrote:
> On Monday, 7 May 2018 at 22:24:25 UTC, Jonathan M Davis wrote:
> > I've been considering adding more configuration options where
> > you say something like you don't care if any invalid characters
> > are encountered, in which case, you could cleanly parse past
> > something like an unescaped &, but you'd then potentially be
> > operating on invalid XML without knowing it and could get
> > undesirable results depending on what exactly is wrong with the
> > XML. I haven't decided for sure whether I'm going to add any
> > such configuration options or how fine-grained they'd be, but
> > either way, the current behavior will continue to be the
> > default behavior.
> >
> > - Jonathan M Davis
>
> I'm not going to ask for that (configuration). I may look into
> cloning dxml and changing it to parse the badly formed XML.

Well, for the general case at least, being able to configure the parser to
not care about certain types of validation is the best that I can think of
at the moment for dealing with invalid XML (especially with the issues
caused by the fact that only one range actually does the validation, making
selective skipping of invalid stuff while parsing a very iffy proposition).
dxml was designed with the idea that it would be operating on valid XML, and
designing a parser to operate on invalid XML can get very tricky - to the
point that it may simply be best for the programmer to design their own
solution tailored to their particular use case if they're going to be
encountering a lot of invalid XML.

If all that's needed is to tell the parser to allow stuff like lone
ampersands, then that's quite straightforward, but if you're dealing with
anything more wrong than that, then things get hairy fast. It's those sorts
of problems that have made html parsers so wildly inconsistent in what they
do.

Personally, I think that we'd have all been better off if the various
protocols (particularly those related to the web) had always called for
strict validation and rejected anything that didn't follow the spec.
Instead, we've got this whole idea of "be strict in what you emit but relax
in what you accept," and the result is that we've got a lot of incorrect
implementations and a lot of invalid data floating around. And of course, if
you don't accept something and someone else does, then your code is
considered buggy even if it follows the protocol perfectly and the data is
clearly invalid. So, in general, we're all kind of permanently screwed. :(

If I can do reasonable things to make dxml better handle bad data, then I'm
open to it, but given dxml's design, the options are somewhat limited, and
it's just plain a hard problem in general.

- Jonathan M Davis