dxml 0.2.0 released

Tue Feb 13 22:13:36 UTC 2018

On Tue, Feb 13, 2018 at 09:18:12PM +0000, Patrick Schluter via Digitalmars-d-announce wrote:
> On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote:
[...]
> > If it's 100% sure that entity references can be treated as just text
> > and that you can't end up with stuff like start tags or end tags
> > being inserted and messing with the parsing such that they all have
> > to be replaced for the XML to be correctly parsed, then I have no
> > problem passing entity references along, and a higher level parser
> > could try to do something with them, but it's not clear to me at all
> > that an XML document with entity references is correct enough to be
> > parsed while not replacing the entity references with whatever XML
> > markup they contain. I had originally passed them along with the
> > idea that a higher level parser could do something with them, but I
> > decided that I couldn't do that if you could do something like drop
> > a start tag in there and change the meaning of the stuff that needs
> > to be parsed that isn't directly in the entity reference.

This made me go to the W3C spec (https://www.w3.org/TR/xml/) to figure
out what exactly is/isn't defined.  I discovered to my chagrin that XML
entities are a huge rabbit hole with extremely pathological behaviour
that makes it almost impossible to implement in any way that's even
remotely efficient.

Here's a page with examples of how nasty it can get:

	http://www.floriankaeferboeck.at/XML/Comparison.html

Here's an example given in the W3C spec itself:

	<?xml version='1.0'?>
	<!DOCTYPE test [
	<!ELEMENT test (#PCDATA) >
	<!ENTITY % xx '%zz;'>
	<!ENTITY % zz '<!ENTITY tricky "error-prone" >' >
	%xx;
	]>
	<test>This sample shows a &tricky; method.</test>

A correct XML parser is supposed to produce the following text as the
body of the <test>...</test> tag (the grammatical error is intentional):

	This sample shows a error-prone method.

Fortunately, there's a glimmer of hope on the horizon: in section 4.3.2
of the spec (https://www.w3.org/TR/xml/#wf-entities), it is explicitly
stated:

	A consequence of well-formedness in general entities is that the
	logical and physical structures in an XML document are properly
	nested; no start-tag, end-tag, empty-element tag, element,
	comment, processing instruction, character reference, or entity
	reference can begin in one entity and end in another.

Meaning, if I understand it correctly, that you can't have a start tag
in &entity1; and its corresponding end tag in &entity2;, and then have
your document contain "&entity1; &entity2;".  This is because the body
of the entity can only contain text or entire tags (the production
"content" in the spec); an entity that contains an open tag without an
end tag (or vice versa) does not match this rule and is thus illegal.

So this means that we *can* use dxml as a backend to drive a
DTD-supporting XML parser implementation.  The wrapper / higher-level
parser would scan the slices returned by dxml for entity references, and
substitute them accordingly, which may involve handing the body of the
entity to another instance of dxml to parse any tags that may be nested
in there.

The nastiness involving partially-formed entity references (as seen in
the above examples) apparently only applies inside the DOCTYPE
declaration, so AIUI this can be handled by the higher-level parser as
part of replacing inline entities with their replacement text.

(The higher-level parser has a pretty tall order to fill, though,
because entities can refer to remote resources via URI, meaning that an
innocuous-looking 5-line XML file can potentially expand to terabytes of
XML tags downloaded from who knows how many external resources
recursively. Not to mention a bunch of security issues like described
below.)

> There's also the issue that entity references open a whole can of
> worms concerning security. It quite possible to have an exponential
> growing entity replacement that can take down any parser.
> 
> <!DOCTYPE root [
>  <!ELEMENT root ANY>
>  <!ENTITY LOL "LOL">
>  <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;">
>  <!ENTITY LOL2
> "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;">
>  <!ENTITY LOL3
> "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;">
>  <!ENTITY LOL4
> "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;">
>  <!ENTITY LOL5
> "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;">
>  <!ENTITY LOL6
> "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;">
>  <!ENTITY LOL7
> "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;">
>  <!ENTITY LOL8
> "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;">
>  <!ENTITY LOL9
> "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;">
> ]>
> <root>&LOL9;</root>
> 
> Hope you have enough memory (this expands to a 3 000 000 000 LOL's)
[...]

Yeah, after reading through relevant portions of the spec, I have to say
that full DTD support is a HUGE can of worms.  I tip my hats off in
advance to the brave soul (or poor fool :-P) who would attempt to
implement the spec in full. :-D

There are ways to deal with exponential entity growth, e.g., if the
expansion was carried out lazily.  But it's still a DOS vulnerability if
the software then spins practically forever trying to traverse the huge
range of stuff being churned out.

Not to mention that having embedded external references is itself a
security issue, particular since the partial entity formation thing can
be used to obfuscate the real URI of a referenced entity, so you could
potentially trick a remote XML parser to download stuff from
questionable sources.  It could be used as a covert surveillance method,
for example, or a malware delivery vector, if combined with an
exploitable bug in the parser code.  Or it could be used to read
sensitive files (e.g., if an entity references file:///etc/passwd or
some such system file).  Ick.

Ironically, the general advice I found online w.r.t XML vulnerabilities
is "don't allow DTDs", "don't expand entities", "don't resolve
externals", etc..  There also aren't many XML parsers out there that
fully support all the features called for in the spec.  IOW, this
basically amounts to "just use dxml and forget about everything else".
:-D

Now of course, there *are* valid use cases for DTDs... but a naïve
implementation of the spec is only going to end in tears.  My current
inclination is, just merge dxml into Phobos, then whoever dares
implement DTD support can do so on top of dxml, and shoulder their own
responsibility for vulnerabilities or whatever.  (I mean, seriously,
just for the sake of being able to say "my XML is validated" we have to
implement network access, local filesystem access, a security framework,
and what amounts to a sandbox to control pathological behaviour like
exponentially recursive entities?  And all of this, just to handle rare
corner cases?  That's completely ridiculous.  It's an obvious design
smell to me.  The only thing missing from this poisonous mix is Turing
completeness, which would have made XML hackers' heaven.  Oh wait, on
further googling, I see that XSLT *is* Turing complete.  Great, just
great.   Now I know why I've always had this gut feeling that
*something* is off about the whole XML mania.)

T

-- 
English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry Wall