High performance XML parser

Tue Feb 8 16:16:37 PST 2011

Steven Schveighoffer napisał:

> > The design I'm thinking is that the node iterator will own a buffer. One  
> > consequence is that the fields of the current node will point to the  
> > buffer akin to foreach(line; File.byLine), so in order to lift the input  
> > the user will have to dup (or process the node in-place). As new nodes  
> > will be overwritten on the same piece of memory, an important trait of  
> > the design emerges: cache intensity. Because of XML namespaces I think  
> > it is necessary for the buffer to contain the current node plus all its  
> > parents.  
> 
> That might not scale well.  For instance, if you are accessing the 1500th  
> child element of a parent, doesn't that mean that the buffer must contain  
> the full text for the previous 1499 elements in order to also contain the  
> parent?
> 
> Maybe I'm misunderstanding what you mean.

Let's talk on an example:

<a name="value">
	<b>
		Some Text 1
		<c2>      <!-- HERE -->
		Some text 2
		</c2>
		Some Text 3
	</b>
</a>

The buffer of the iterator positioned HERE would be:

[Node a | Node b | Node c2]

Node c2 and all its parents are available for inspection. Node a's attribute is stored in the buffer, but not b's "Some text 1" as it is c2's sibling; "Some text 1" was available in the previous iteration, now it's overwritten by c2. To get to "Some text 2" let's advance the iterator in depth to get:

[Node a | Node b | Node c2 | Text node "Some text 2"]

Advancing it once more we get to:

[Node a | Node b | Text node "Some text 3"]

So "Some text 3" is written where c2 and the text node 2 used to be.

The element type of the range would always be the child, parents available through pointers:

foreach (node; xmlRange) {
	doStuff(node);
	if (Node* parent = node.parent)
		doOtherStuff(parent);
}

Having no access to siblings is quite limiting but the iterator can form an efficient (zero-allocation) basis on which more convenient schemes are built upon. It's still just brain-storming, though. I fear there's something that'll make the whole idea crash & burn.

> I would start out with a non-compliant parser, but one that allocates  
> nothing beyond the I/O buffer, one that simply parses lazily and can be  
> used as well as a SAX parser.  Then see how much extra allocations we need  
> to get it to be compliant.  Then, one can choose the compliancy level  
> based on what performance penalties one is willing to incur.

Yeah, 100% compliance is a long way.

-- 
Tomek