dxml 0.2.0 released

H. S. Teoh hsteoh at quickfur.ath.cx
Tue Feb 13 22:29:27 UTC 2018


On Tue, Feb 13, 2018 at 03:00:59PM -0700, Jonathan M Davis via Digitalmars-d-announce wrote:
[...]
> The big problem is how the entity references affect the parsing. If
> start tags can be dropped in and affect the parsing (and it's still
> not clear to me from the spec whether that's legal - there is a
> section talking about being nested properly which might indicate that
> that's not legal, but it's not very specific or clear), and if it's
> legal to do something like use an entity reference for a tag name -
> e.g. <&foo;>, then that's a serious problem. And problems like that
> are the main reason why I completely dropped any attempt to do
> anything with the DTD section.

AFAICT, section 4.3.2 in the spec (probably the one you're referring to)
seems to be saying that you can't do that:

	A consequence of well-formedness in general entities is that the
	logical and physical structures in an XML document are properly
	nested; no start-tag, end-tag, empty-element tag, element,
	comment, processing instruction, character reference, or entity
	reference can begin in one entity and end in another.


> If entity references are only legal in the text between start and end
> tags and between the quotes of attribute values, and whatever they're
> replaced with cannot actually affect anything else in the XML document
> (i.e. it can't just be a start or end tag or anything like that - it
> has to be fulling parseable on its own and not affect the parsing of
> the document itself), then passing them along should be fine.

That's the approach I'm thinking of.


[...]
> Regardless, there's no risk of dxml's parser ever being changed to
> actually replace entity references. That doesn't work with returning
> slices of the original input, and it really doesn't work with a parser
> that's just supposed to take a range of characters and parse it. To
> fully handle all of the DTD stuff means actually reading files from
> disk or from the internet - which of course is where the security
> problems come in, but it also means that you're not just dealing with
> a parser anymore. In principle, dxml's parser should be pure (though
> some implementation make it so that it isn't right now), whereas an
> XML parser that fully handles the DTD section could never be pure.
[...]

Given the insane complexities of DTD that I'm only slowly beginning to
grasp from actually reading the spec, I'm quickly adopting the opinion
that dxml should remain as-is, and any DTD implementation should be
layered on top.  The only potential changes that might be needed is:

- provide a way to parse XML snippets that don't have a <?xml ...>
  declaration, so that a DTD implementation could, for example, hand an
  entity body over to dxml to extract any tags that may be nested in
  there (and if my reading of section 4.3.2 is correct, all such tags
  must always be closed inside the entity body, so there should be no
  errors produced).

- provide some way of hooking into non-default entities so that
  DTD-defined entities can be expanded by the DTD implementation.  This
  could be as simple as leaving such entities untouched in the returned
  range, or invent a special EntityType representing such entities (with
  a slice of the input containing the entity name) so that the DTD
  implementation can insert the replacement text. 

Everything else should be handled by the DTD layer, e.g., parsing the
DOCTYPE section (which is itself pretty pathological, given the actual
examples in the W3C spec to this effect), expanding entities, looking up
external entities, limiting recursive entity expansion, implementing a
security model, etc..


T

-- 
Why do conspiracy theories always come from the same people??


More information about the Digitalmars-d-announce mailing list