Vote: deprecate std.xml?

Jonathan M Davis newsgroup.d at jmdavisprog.com
Fri Jan 17 21:43:13 UTC 2020


On Friday, January 17, 2020 11:54:21 AM MST H. S. Teoh via Digitalmars-d 
wrote:
> On Fri, Jan 17, 2020 at 09:50:52AM -0500, Steven Schveighoffer via 
Digitalmars-d wrote:
> > On 1/17/20 2:01 AM, Alex Burton wrote:
> > > It works well, and resulted in large performance increases.
> > >
> > > It would be great if something like dxml was in the standard
> > > library.
> >
> > I think the biggest stumbling block was something like schema
> > validation. I can't remember the exact details but Jonathan did not
> > want to include it because it's a security concern. Something in
> > Phobos shouldn't ignore a large part of the standard.
>
> [...]
>
> No, I don't think it was because of security, it was more because of
> performance, because the current implementation of dxml uses slicing
> extensively to avoid needless copying of data. But to validate a schema
> according to spec, esp. some of the more obscure (and convoluted)
> corners of the spec, you'd need to pre-parse the whole thing and
> allocate a bunch of stuff before you can run the validation.
>
> The other stumbling block is entity support, which again has some
> rarely-used corner cases in the spec where they can recursively expand
> to arbitrarily large content (IIRC it may even involve network access or
> at least local filesystem access[*]) that may entirely change the
> meaning of subsequent characters (and resulting parse tree). This would
> make the current slices-based API impossible, which kinda undermines
> dxml's entire underlying premise.
>
> ([*] Yeah, the XML spec is IMNSHO the epitome of design by committee
> producing an insanely-overengineered over-complex system, most features
> of which normal people never use or are even aware of.)

Basically, you're both right. Security was part of the problem, but it
wasn't the core problem. Honestly, everything involved with DOCTYPE was a
terrible idea, and if the people who thought it up haven't come to that
conclusion in the interim, they should do some serious soul searching. E.G.
how on earth did anyone think that it was a good idea for a document to tell
an application what constituted a valid document? That's total nonsense.
It's up to the application to determine whether its input is valid, whereas
DOCTYPE basically makes it the input's job to tell the application whether
the input is valid. How did anyone think that that was a good idea? And
adding what is essentially a #include and macro system to a document format?
How on earth is _that_ a good idea? The DOCTYPE section adds a ton of
complexity to the XML spec and any XML parser that would attempt to fully
implement it, and it's the sort of thing that no one should have anything to
do with unless they have no choice (which unfortunately is proably the case
for some people).

Both the security concern and the chief reason that dxml does not support
the DOCTYPE section beyond parsing past it have to do with entity
references. The DOCTYPE section can not only define entity references, but
it can point to other documents which then have to be parsed in order to
find the definitions of entity references. Those entity references then get
replaced with more or less arbitrary chunks of XML based on their
definitions. Basically, it's the equivalent of #including files to access
macros that are #defined in those files. The fact that you have to worry
about going and parsing another document makes it impossible to simply parse
an arbitrary XML document. Suddenly, the parser has to care about where the
XML file is on disk (so that it can correctly follow any file paths), and it
potentially has to download documents from the internet (since arbitrary
URLs can be provided) - which is a big security concern. And regardless of
whether the entity references are defined in the current document or a
separate document, the fact that they can insert more or less arbitrary XML
destroys your ability to have the output simply be slices of the input.

One of the core design goals of dxml was that the output type be either the
same as the input type or that it be a TakeExactly of the input type. That
way, if you give it a string, you get strings back. It's very efficient that
way, and it's way more user friendly. I did not want wrappers coming out the
other side, because that would pretty much inevitably result in additional
memory allocations occuring just to get strings again. And if you're
potentially inserting arbitrary text into the middle of your string, you
can't just return a slice. So, while I had originally tried to support the
DOCTYPE section in spite of thinking that it's a terrible, terrible idea
that it even exists, once I figured out that I couldn't do it while
returning slices, I dropped all of my code that was trying to deal with the
DOCTYPE section, and I will never make dxml support it. It would be making
the parser far worse for the common use case just to support the rare use
case (or what certainly _should_ be a rare use case).

> The ironic thing is that the cases that dxml *does* support are the only
> cases that 99% of XML users would ever actually need. Yet there's that
> annoying 1% of obscure and insanely-complex corner in the spec that
> *some* people out there actually expect to work, which prevents us from
> saying that dxml implements the entire XML spec.  And Phobos being the
> epitome of perfectionism, this means dxml will likely never make it in.
> Or if it does, it's almost guaranteed that *somebody* will barge in and
> complain loudly about how std.dxml doesn't *actually* implement the XML
> spec.

Yeah. std.xml doesn't support the DOCTYPE stuff either, but I'm sure that if
I went through the Phobos review process with dxml, there would be some
people screaming that if it's in the standard library, it must support the
entirety of the spec. If some poor soul wants to actually implement a parser
in D that does that, then all the more power to them, but the result is
bound to be worse than dxml for those of us who want to parse XML documents
that don't use DOCTYPE-specific features - which is almost certainly most of
us.

- Jonathan M Davis





More information about the Digitalmars-d mailing list