High performance XML parser

Tomek Sowiński just at ask.me
Fri Feb 4 14:36:50 PST 2011


Steven Schveighoffer napisał:

> Here is how I would approach it (without doing any research).
> 
> First, we need a buffered I/O system where you can easily access and  
> manipulate the buffer.  I have proposed one a few months ago in this NG.
> 
> Second, I'd implement the XML lib as a range where "front()" gives you an  
> XMLNode.  If the XMLNode is an element, it will have eager access to the  
> element tag, and lazy access to the attributes and the sub-nodes.  Each  
> XMLNode will provide a forward range for the child nodes.
> 
> Thus you can "skip" whole elements in the stream by popFront'ing a range,  
> and dive deeper via accessing the nodes of the range.
> 
> I'm unsure how well this will work, or if you can accomplish all of it  
> without reallocation (in particular, you may need to store the element  
> information, maybe via a specialized member function?).

Heh, yesterday when I couldn't sleep I was sketching the design. I converged to a pretty much same concept, so your comment is reassuring :).

The design I'm thinking is that the node iterator will own a buffer. One consequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift the input the user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus all its parents. Namespaces are the technical reason but having access to the path all the way to the root node is of value, regardless. This suggests mark-release memory management. The buffer will have to be long enough to fit the deepest tag sequence: theoretically infinite, not that much in practice. Like I said, the buffer will be owned by the iterator so probably deterministic deallocation is possible when the processing is done.

One drawback is that you won't know you're dealing with a well-formed DOM until the closing tag comes. If it doesn't, it'll of course throw, but the malformed DOM may already have been digested. So providing some rollback possibility is up to the user.

-- 
Tomek



More information about the Digitalmars-d mailing list