dxml 0.2.0 released

Mon Feb 12 21:51:56 UTC 2018

On Mon, Feb 12, 2018 at 09:50:16AM -0700, Jonathan M Davis via Digitalmars-d-announce wrote:
[...]
> The core problem is that entity references get replaced with more XML
> that needs to be parsed. So, they can't simply be passed on for
> post-processing.  As I understand it, they have to be replaced while
> the parsing is going on.  And that means that you can't do something
> like return slices of the original input that don't bother with the
> entity references and then have a separate parser take that and
> process it further to deal with the entity references. The first
> parser has to deal with them, and that means not returning slices of
> the original input unless you're dealing purely with strings and are
> willing to allocate new strings in the cases where the data needs to
> be mutated because of an entity reference.
[...]

I think you missed my point.

What I'm trying to say is, given the current functionality of dxml, one
*can* build an XML interface that implements DTD support.

Of course, some concessions obviously have to be made, such as needing
to allocate memory (I don't see how else one could keep a dictionary of
DTD rules / entity declarations otherwise, for example), or not being
able to return only slices of the input anymore.  For example, entity
support pretty much means plain slices are no longer an option, because
you have to perform substitution of entity definitions, so you'll have
to either wrap it in some kind of lazy range that chains the entity
definition to the surrounding text, or you'l have to use strings or
something else.  Which means you'll need to have memory allocation /
slower parsing / whatever, but that's the price of DTD support.

But again, the point is, basic XML parsing (without DTD support) doesn't
*need* to pay this price. What's currently in dxml doesn't need to
change. DTD support can be implemented in a submodule / separate module
that wraps around dxml and builds DTD support on top of it.

Put another way, we can implement DTD support *on top of* dxml this way:
- Parse the XML using dxml as an initial step (this can be done lazily,
  or semi-lazily, as needed).
- As an intermediate step, parse the DTD section, construct whatever
  internal state is needed to handle DTD rules, a dictionary of entity
  references, etc..
- Filter the output of dxml to insert whatever extra behaviour is needed
  to implement DTD support before handing it to the calling code, e.g.,
  expand entity references, or implement validation and throw an
  exception if validation fails, etc..

*We don't need to change dxml's current API at all.*

At the most, I anticipate that the only potential change needed is to
expose an interface to parse XML fragments (i.e., not a complete XML
document that contains an outer <xml> tag, but just some PCDATA that may
contain entities or tags) so that the DTD support wrapper can use it to
expand entities and insert any tags that may appear inside the entity
definition.

The DTD wrapper doesn't guarantee (and doesn't need to!) to return
slices of the input like dxml does. I don't see that as a problem, since
I can't see how anyone would be able to implement full DTD support with
only slices, even independently from the way dxml is implemented right
now.

We can even design the DTD support wrapper to start with being just a
thin wrapper around dxml, and lazily switch to full DTD mode only if a
DTD section is encountered.  Then user code that doesn't care to use
dxml's raw API won't even need to care about the difference.

T

-- 
Curiosity kills the cat. Moral: don't be the cat.