Status of std.xml (D2/Phobos)

Tue Jun 29 05:27:08 PDT 2010

On 2010-06-29 04:41:50 -0400, Alix Pexton <alix.DOT.pexton at gmail.DOT.com> said:

> On 28/06/2010 15:11, Steven Schveighoffer wrote:
> 
>> Yes, I don't think the phobos solution needs to mimic exactly the API of
>> SAX or DOM, the author should be free to use D idioms. But starting with
>> a common proven design is probably a good idea.
>> 
>> -Steve
> 
> I've been thinking about it, and while I believe you when you say that 
> SAX can be used to build the DOM, I'm not convinced that SAX is the 
> lowest common abstraction.
> 
> Michel Fortin's Tokenizer/Range seems much closer to the metal to me.

It is closer to the metal, but there's a catch...

One issue with SAX is that you must allocate an array of strings to 
pass the attributes of an element, which is probably going to need a 
dynamic allocation at some point. A lower-level abstraction such as 
mine (or Tango's pull-parser) just returns each attribute as a separate 
token as it parses them.

The downside of the tokenizer interface is that it only checks for a 
subset of well-formness, for instance it doesn't check that tags 
balance each other correctly or that there is no two attributes with 
the same name. It's just a "tokenizer" after all, it can't be described 
as a conformant XML parser by itself. The upper layer parser needs to 
check for these things. My mini DOM built on this tokenizer does these 
checks when using the tokenizer, and it's more efficient to do them 
there because that's where the context information is kept, which is 
why the tokenizer doesn't do them.

Implementing SAX on top of my tokenizer consists mostly of ensuring 
proper tag balancing, checking for duplicate attributes, and collecting 
attributes in an array (or another kind of list) you can then give to 
the openElement SAX callback.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/