High performance XML parser
Tomek Sowiński
just at ask.me
Tue Feb 8 16:16:37 PST 2011
Steven Schveighoffer napisał:
> > The design I'm thinking is that the node iterator will own a buffer. One
> > consequence is that the fields of the current node will point to the
> > buffer akin to foreach(line; File.byLine), so in order to lift the input
> > the user will have to dup (or process the node in-place). As new nodes
> > will be overwritten on the same piece of memory, an important trait of
> > the design emerges: cache intensity. Because of XML namespaces I think
> > it is necessary for the buffer to contain the current node plus all its
> > parents.
>
> That might not scale well. For instance, if you are accessing the 1500th
> child element of a parent, doesn't that mean that the buffer must contain
> the full text for the previous 1499 elements in order to also contain the
> parent?
>
> Maybe I'm misunderstanding what you mean.
Let's talk on an example:
<a name="value">
<b>
Some Text 1
<c2> <!-- HERE -->
Some text 2
</c2>
Some Text 3
</b>
</a>
The buffer of the iterator positioned HERE would be:
[Node a | Node b | Node c2]
Node c2 and all its parents are available for inspection. Node a's attribute is stored in the buffer, but not b's "Some text 1" as it is c2's sibling; "Some text 1" was available in the previous iteration, now it's overwritten by c2. To get to "Some text 2" let's advance the iterator in depth to get:
[Node a | Node b | Node c2 | Text node "Some text 2"]
Advancing it once more we get to:
[Node a | Node b | Text node "Some text 3"]
So "Some text 3" is written where c2 and the text node 2 used to be.
The element type of the range would always be the child, parents available through pointers:
foreach (node; xmlRange) {
doStuff(node);
if (Node* parent = node.parent)
doOtherStuff(parent);
}
Having no access to siblings is quite limiting but the iterator can form an efficient (zero-allocation) basis on which more convenient schemes are built upon. It's still just brain-storming, though. I fear there's something that'll make the whole idea crash & burn.
> I would start out with a non-compliant parser, but one that allocates
> nothing beyond the I/O buffer, one that simply parses lazily and can be
> used as well as a SAX parser. Then see how much extra allocations we need
> to get it to be compliant. Then, one can choose the compliancy level
> based on what performance penalties one is willing to incur.
Yeah, 100% compliance is a long way.
--
Tomek
More information about the Digitalmars-d
mailing list