[due diligence] std.xml

Tue Oct 19 15:16:54 PDT 2010

On 2010-10-19 16:43:04 -0400, sybrandy <sybrandy at gmail.com> said:

> I guess one question we need to ask is what do we expect from this 
> library?  Do we want a full DOM implementation or is a SAX parser good 
> enough?  Or do we need something in between?  In PHP or Perl, perhaps 
> both, I saw a library where an XML document was essentially transformed 
> into nested associative arrays.  It made it very easy to read data from 
> the XML, however I don't know how much of the official standards it 
> complied with.

Many people have different needs for XML, it's hard to come with 
something that pleases everyone. I might have the solution to that 
however: a template that makes it easy to implement any kind of parser.

I've made two xml modules a little while ago. The first is a tokenizer 
template that can work either as a pull-parser or callback-parser, or 
even a mix of both, and is reentrant (you can invoke the tokenizer 
inside a callback to parse new tokens). The implementation has been 
written based on the XML spec so I'm confident that the parser is 
pretty much standard. In regard to the standard, the tokenizer lacks 
support for DTD internal subsets and user-defined character entities, 
and leaves some well-formness checks to the upper layers (like checking 
if tag name matches) where it should be less costly for those checks to 
happen.

The second module is a basic tree model based on the tokenizer. It 
doesn't try to be DOM-conformant, but it shows how the tokenizer can be 
used and implements the higher-level well-formness checks (matching tag 
names). Building a SAX parser on top of the tokenizer would be a piece 
of cake too.

It might be incomplete, but this code works: it's already in production 
in a small program (script?) of mine. I don't really have the time to 
work on it at the moment, but if anyone wants to take it and improve 
upon it, then it could probably become Phobos's XML parser. One thing 
that should be done is make the tokenizer accept ranges, something I 
started a couple of months ago but which I never finished.

Here's the (slightly outdated) documentation. If someone wants to 
proceed I'll extract the code from the rest of my code and release it 
under the boost license.

http://michelf.com/docs/d/mfr/xmltok.html
http://michelf.com/docs/d/mfr/xml.html

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/