High performance XML parser

Fri Feb 4 13:21:26 PST 2011

On Fri, 04 Feb 2011 16:02:39 -0500, Tomek Sowiński <just at ask.me> wrote:

> I am now intensely accumulating information on how to go about creating  
> a high-performance parser as it quickly became clear that my old one  
> won't deliver. And if anything is clear is that memory is the key.
>
> One way is the slicing approach mentioned on this NG, notably used by  
> RapidXML. I already contacted Marcin (the author) to ensure that using  
> solutions inspired by his lib is OK with him; it is. But I don't think  
> I'll go this way. One reason is, surprisingly, performance. RapidXML  
> cannot start parsing until the entire document is loaded and ready as a  
> random-access string. Then it's blazingly fast but the time for I/O has  
> already elapsed. Besides, as Marcin himself said, we need a 100%  
> W3C-compliant implementation and RapidXML isn't one.
>
> I think a much more fertile approach is to operate on a forward range,  
> perhaps assuming bufferized input. That way I can start parsing as soon  
> as the first buffer gets filled. Not to mention that the end result will  
> use much less memory. Plenty of the XML data stream is indents, spaces,  
> and markup -- there's no reason to copy all this into memory.
>
> To sum up, I belive memory and overlapping I/O latencies with parsing  
> effort are pivotal.
>
> Please comment on this.

Here is how I would approach it (without doing any research).

First, we need a buffered I/O system where you can easily access and  
manipulate the buffer.  I have proposed one a few months ago in this NG.

Second, I'd implement the XML lib as a range where "front()" gives you an  
XMLNode.  If the XMLNode is an element, it will have eager access to the  
element tag, and lazy access to the attributes and the sub-nodes.  Each  
XMLNode will provide a forward range for the child nodes.

Thus you can "skip" whole elements in the stream by popFront'ing a range,  
and dive deeper via accessing the nodes of the range.

I'm unsure how well this will work, or if you can accomplish all of it  
without reallocation (in particular, you may need to store the element  
information, maybe via a specialized member function?).

-Steve