Status of std.xml (D2/Phobos)

Mon Jun 28 06:59:45 PDT 2010

On 28/06/2010 13:04, Steven Schveighoffer wrote:
> On Sun, 27 Jun 2010 14:56:21 -0400, Yao G. <nospamyao at gmail.com> wrote:
>
>> I did a simple implementation of a pull parser, using this API as
>> reference: http://xmlpull.org/
>>
>> But I used a iterator similar to the one used by Steve (from
>> dcollections) to parse the doc. It turns out that Tango did something
>> similar first (using iterator to parse the document), and seeing the
>> debacle caused by the Date module, I think it would be a bad idea to
>> release it.
>
> Did you look at Tango's code in question, or look at their
> documentation? If not, then you are fine.
>
> I think any implementation is going to have to at least try to use
> ranges or show why they are not a good idea for xml, since Andrei is set
> on using ranges for everything.
>
> BTW, I've not used std.xml or tango's xml, but I agree that an xml
> library is a very important part of today's standard libraries. Having
> xml in the standard allows for so much usage of it in many other places
> (serialization comes to mind immediately). If std.xml is bad (which I've
> heard from several independent people), then throw it out and make
> something new.
>
> I myself have tried to think of how xml can be done with ranges, but I
> believe one of the key elements is it has to parse xml without loading
> the entire document to be efficient enough for some applications. A DOM
> style parser which presents a range interface is probably fine, but a
> lazy interface would be the best. Since XML is a tree style, you need a
> range which allows moving down the tree. You almost need a stacking
> range which can move down the tree and also to the next sibling element.
> Ideally, the library should do as much as possible without allocating
> anything but buffer space to read data.
>
> -Steve

I've not looked at any of the D XML offerings (shame on me?) but I've 
been having a bit of a look at the types of API that are available in 
other languages, and there seems to be 3...

Event based a la SAX

Stream based a la StAX

Tree based a la "the" DOM

The simple conclusion that I have drawn is that the is no 
one-size-fits-all solution, and that it would therefore be a mistake to 
put all effort into supporting only one. (However, ranges do seem to 
match up quite nicely with the way that the Stream based APIs operate.)

It would seem to me most logical to consider the many varied use-cases 
and build a core API upon which all 3 types of XML processor can be 
built (or at least specify a core set of types to be used by all 3), 
rather than focus on implementing one particular style. Interoperability 
of all 3 styles would then be possible and perhaps facilitate the later 
implementation of higher abstractions (such as XPath and XQuery).

I think it is also important to remember that there are at least 4 
different stages to processing XML (reading, validating, mutating, 
writing) and that many programming tasks allow one or more of these 
aspects to be ignored. This can mean that one programmer is blinded to 
the requirements of another in a different domain because the ways in 
which they work with XML either overlap only partially or not at all.

I've never used anything like SAX myself, though I have used the DOM 
quite a lot, and spent most of the time wishing it worked a bit more 
like StAX (even though I hadn't heard of StAX at the time ^^).

What ever is done for D, it should allow programmers to work with XML in 
a way that is familiar to them and compatible with what others do. 
Memory should be used conservatively, and reprocessing (parsing the same 
portion of a document multiple times) should be minimised.

Most importantly, the implementation should be D-ey, rather that the 
abstraction used in any other language's most favoured solution, 
shoehorned into a D-shaped box.

A...
(whose 2 cents are worth no more or no less than anyone else's.)