dxml 0.2.0 released

Jonathan M Davis newsgroup.d at jmdavisprog.com
Mon Feb 12 16:50:16 UTC 2018


On Monday, February 12, 2018 07:59:24 H. S. Teoh via Digitalmars-d-announce 
wrote:
> On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via
> Digitalmars-d-announce wrote: [...]
>
> > However, if folks as a whole think that Phobos' xml parser needs to
> > support the DTD section to be acceptable, then dxml won't replace
> > std.xml, because dxml is not going to implement DTD support. DTD
> > support fundamentally does not fit in with dxml's design.
>
> Actually, thinking about this, I'm wondering if a combination of
> preprocessing and/or postprocessing might make it possible to implement
> DTD support without needing to rewrite the guts of dxml. AIUI, dxml does
> parse the DTD section correctly, i.e., as an XML directive, but only
> doesn't look into its internal details. So one way to implement DTD
> support might be:
>
> - Write an auxiliary parser that's basically a wrapper around dxml,
>   forwarding XML events to the caller, except:
> - If a DTD event is encountered, eagerly parse it, store DTD
>   declarations internally for future reference.
> - If there's a DTD that has been seen, perform on-the-fly validation as
>   XML events are forwarded.
> - In PCDATA sections, if there are entity references to the DTD, expand
>   them, possibly inserting more XML events into the stream based on
>   what's defined in the DTD. (This may need to reuse some dxml internals
>   to parse XML snippets that might be contained in an entity definition,
>   for example.)

The core problem is that entity references get replaced with more XML that
needs to be parsed. So, they can't simply be passed on for post-processing.
As I understand it, they have to be replaced while the parsing is going on.
And that means that you can't do something like return slices of the
original input that don't bother with the entity references and then have a
separate parser take that and process it further to deal with the entity
references. The first parser has to deal with them, and that means not
returning slices of the original input unless you're dealing purely with
strings and are willing to allocate new strings in the cases where the data
needs to be mutated because of an entity reference.

If we were going to stick to strings and only strings, it would be quite
possible to define the API in a way that it may or may not do DTD
processing, but that doesn't work with arbitrary ranges of characters, not
unless you give up on returning slices of the original input, and that means
harming the performance and usability for the common case in order to
support DTDs.

Also, anything that has the concept of "events" would be drastically
different from what dxml does. dxml is completely range-based. It has no
callbacks or anything of the sort, and having anything like that would
complicate it considerably.

There are lots of interesting things that could be done to try and deal with
the DTD section, but they fundamentally don't work with returning slices of
the original input unless you're only using strings.

In any case, I refuse to change dxml so that it has DTD support, and I
refuse to change it so that it doesn't return slices of the original input.
If I were to do so, it would make the parser worse for any use case I care
about and require a lot of time and effort on my part that I'm not willing
to spend. So, if that makes it so that dxml is never included in Phobos,
then so be it.

Folks are free to decide to support dxml for inclusion when the time comes
and free to vote it as unacceptable. Personally, I think that dxml's
approach is ideal for XML that doesn't use entity references, and I'd much
rather use that kind of parser regardless of whether it's in the standard
library or not. I think that the D community would be far better off with
std.xml being replaced by dxml, but whatever happens happens. I'd be just as
fine with a decision to remove std.xml and not include dxml. I'm less fine
with std.xml being left in Phobos and dxml being rejected, because std.xml
has been recognized as bad, and it sure doesn't look like anyone else is
going to write a replacement any time soon. I also think that dxml's
approach is better for the common case than anything that supported DTDs
would be, so I think that having dxml's solution in Phobos would be better
for the community even if Phobos also had a solution that supported DTDs,
but at this point, it looks like the options are going to be

1. std.xml stays and continues to suck.
2. std.xml gets ripped out and dxml replaces it.
3. std.xml gets ripped out and we have no xml solution in Phobos.

But as it stands, it doesn't seem likely that any XML solution that supports
DTDs being in Phobos is likely to happen any time soon, if ever, because
AFAIK, only three people have put in any real effort towards replacing
std.xml since 2010 or whenever it was that we decided it needed to be
replaced. The first two people both disappeared into oblivion without ever
finishing, and here I am with a working StAX parser (now with DOM support)
and an XML writer in the works - and given how involved I am with D, I think
that it's pretty unlikely that I'm disappearing anywhere short of getting
hit by a bus or whatnot. So, at least I've actually put in the time and
effort towards a solution and made it available, and it will almost
certainly be an essentially complete solution by the time that dconf rolls
around if not well before.

So, I do expect that the question of Phobos inclusion will ultimately be a
question of whether std.xml _ever_ gets replaced, but regardless, at least
there is a solution, and it will continue to be available as a 3rd party
library even if it never makes it into Phobos.

- Jonathan M Davis



More information about the Digitalmars-d-announce mailing list