dxml 0.2.0 released

Mon Feb 12 22:18:33 UTC 2018

On Monday, February 12, 2018 13:51:56 H. S. Teoh via Digitalmars-d-announce 
wrote:
> For example, entity
> support pretty much means plain slices are no longer an option, because
> you have to perform substitution of entity definitions, so you'll have
> to either wrap it in some kind of lazy range that chains the entity
> definition to the surrounding text, or you'l have to use strings or
> something else.  Which means you'll need to have memory allocation /
> slower parsing / whatever, but that's the price of DTD support.

Which was my point. The API as-is doesn't work with DTD support for those
very reasons.

> But again, the point is, basic XML parsing (without DTD support) doesn't
> *need* to pay this price. What's currently in dxml doesn't need to
> change. DTD support can be implemented in a submodule / separate module
> that wraps around dxml and builds DTD support on top of it.
>
> Put another way, we can implement DTD support *on top of* dxml this way:
> - Parse the XML using dxml as an initial step (this can be done lazily,
>   or semi-lazily, as needed).
> - As an intermediate step, parse the DTD section, construct whatever
>   internal state is needed to handle DTD rules, a dictionary of entity
>   references, etc..
> - Filter the output of dxml to insert whatever extra behaviour is needed
>   to implement DTD support before handing it to the calling code, e.g.,
>   expand entity references, or implement validation and throw an
>   exception if validation fails, etc..
>
> *We don't need to change dxml's current API at all.*

I don't think that this works, because the entity references insert new XML
and thus affect the parsing. And as such, you can't simply pass through the
entity references to be processed by another parser. They need to be handled
by the core parser, otherwise it's going to give incorrect results, not just
results that need further parsing. I'm sure that dxml's internals could be
refactored so that they could be shared with another parser that did that,
but unless I'm misunderstanding how entity references work, you can't use
what's there now as-is and build another parser on top of it. The entity
reference replacement needs to happen in the core parser.

> The DTD wrapper doesn't guarantee (and doesn't need to!) to return
> slices of the input like dxml does. I don't see that as a problem, since
> I can't see how anyone would be able to implement full DTD support with
> only slices, even independently from the way dxml is implemented right
> now.

Yeah, if I were writing a parser that handled the DTD section, I wouldn't
make it deal with slices of the input like DTD does unless I decided to make
it always return string, in which case, you could get slices of the original
input for strings but no other range types - it's either that or using a
lazy range, which would be worse if you passed strings but better for other
range types. And that's the main reason that I gave up on having dxml handle
the DTD section. I consider that approach unacceptable. One of the key goals
for dxml was that it would be providing slices of the input and not lazy
ranges or allocating new strings.

In any case, unless I misunderstand how entity references work, that would
have to be its own parser and not simply a wrapper around dxml because of
how the entity references affect the parsing. If I'm wrong, then great,
someone else can come along later and add some sort of DTD parser on top of
dxml, and if I'm right, well, then anyone who wants to do anything like that
is going to need to write a new parser, but that can then coexist alongside
dxml's parser just fine. Either way, I like dxml's approach and don't want
to compromise what it's doing in an attempt to fully deal with DTDs.

- Jonathan M Davis