dxml 0.1.0 released

Fri Feb 9 21:47:52 UTC 2018

On Fri, Feb 09, 2018 at 02:15:33PM -0700, Jonathan M Davis via Digitalmars-d-announce wrote:
> I have multiple projects that need an XML parser, and
> std_experimental_xml is clearly going nowhere, with the guy who wrote
> it having disappeared into the ether, so I decided to break down and
> write one. I've kind of wanted to for years, but I didn't want to
> spend the time on it. However, sometime last year I finally decided
> that I had to, and it's been what I've been working on in my free time
> for a while now. And it's finally reached the point when it makes
> sense to release it - hence this post.

Hooray!  Finally, a glimmer of hope for XML parsing in D!

> Currently, dxml contains only a range-based StAX / pull parser and related
> helper functions, but the plan is to add a DOM parser as well as two writers
> - one which is the writer equivalent of a StaX parser, and one which is
> DOM-based. However, in theory, the StAX parser is complete and quite useable
> as-is - though I expect that I'll be adding more helper functions to make it
> easier to use, and if you find that you're doing a particular operation with
> it frequently and that that operation is overly verbose, please point it out
> so that maybe a helper function can be added to improve that use case - e.g.
> I'm thinking of adding a function similar to std.getopt.getopt for handling
> attributes, because I personally find that dealing with those is more
> verbose than I'd like. Obviously, some stuff is just going to do better with
> a DOM parser, but thus far, I've found that a StAX parser has suited my
> needs quite well. I have no plans to add a SAX parser, since as far as I can
> tell, SAX parsers are just plain worse than StAX parsers, and the StAX
> approach is quite well-suited to ranges.
> 
> Of note, dxml does not support the DTD section beyond what is required to
> parse past it, since supporting it would make it impossible for the parser
> to return slices of the original input beyond the case where strings are
> used (and it would be forced to allocate strings in some cases, whereas dxml
> does _very_ minimal heap allocation right now), and parsing the DTD section
> signicantly increases the complexity of the parser in order to support
> something that I honestly don't think should ever have been part of the XML
> standard and is unnecessary for many, many XML documents. So, if you're
> dealing with XML documents that contain entity references that are declared
> in the DTD section and then used outside of the DTD section, then dxml will
> not support them, but it will work just fine if a DTD section is there so
> long as it doesn't declare any entity references that are then referenced in
> the document proper.
> 
> Hopefully, the documentation is clear enough, but obviously, I'm not
> the best judge of that. So, have at it.
> 
> Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
> Github: https://github.com/jmdavis/dxml
> Dub: http://code.dlang.org/packages/dxml
[...]

Wonderful!  The docs are beautiful, I must say.  Good job on that.
Though a simple example of basic usage in the module header would be
very nice.

Glanced over the docs.  It's a pretty nice and clean API, and IMO,
worthy of consideration to be included into Phobos.  IMO, the lack of
SAX / DOM parsing is not a big deal, since it's not hard to build one
given StAX primitives.

Being range-based is very nice, but I'd say your choice to slice the
input, defer expensive/allocating operations to normalize() is a big
winning point.  This approach is fundamental to high performance, in the
principle of not doing any operation that isn't strictly necessary until
it's actually asked for.  If nothing else, this is a good design pattern
that I plan to st^Wcopy in my own code. :-P

As for DTDs, perhaps it might be enough to make normalize() configurable
with some way to specify additional entities that may be defined in the
DTD?  Once that's possible, I'd say it's Good Enough(tm), since the user
will have the tools to build DTD support from what they're given.  Of
course, "standard" DTD support can be added later, built on the current
StAX parser.

I would support it if you proposed dxml to be added to Phobos.

T

-- 
There are 10 kinds of people in the world: those who can count in binary, and those who can't.