High performance XML parser

Mon Feb 7 04:40:30 PST 2011

On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just at ask.me> wrote:

> Steven Schveighoffer napisał:
>
>> Here is how I would approach it (without doing any research).
>>
>> First, we need a buffered I/O system where you can easily access and
>> manipulate the buffer.  I have proposed one a few months ago in this NG.
>>
>> Second, I'd implement the XML lib as a range where "front()" gives you  
>> an
>> XMLNode.  If the XMLNode is an element, it will have eager access to the
>> element tag, and lazy access to the attributes and the sub-nodes.  Each
>> XMLNode will provide a forward range for the child nodes.
>>
>> Thus you can "skip" whole elements in the stream by popFront'ing a  
>> range,
>> and dive deeper via accessing the nodes of the range.
>>
>> I'm unsure how well this will work, or if you can accomplish all of it
>> without reallocation (in particular, you may need to store the element
>> information, maybe via a specialized member function?).
>
> Heh, yesterday when I couldn't sleep I was sketching the design. I  
> converged to a pretty much same concept, so your comment is reassuring  
> :).
>
> The design I'm thinking is that the node iterator will own a buffer. One  
> consequence is that the fields of the current node will point to the  
> buffer akin to foreach(line; File.byLine), so in order to lift the input  
> the user will have to dup (or process the node in-place). As new nodes  
> will be overwritten on the same piece of memory, an important trait of  
> the design emerges: cache intensity. Because of XML namespaces I think  
> it is necessary for the buffer to contain the current node plus all its  
> parents.

That might not scale well.  For instance, if you are accessing the 1500th  
child element of a parent, doesn't that mean that the buffer must contain  
the full text for the previous 1499 elements in order to also contain the  
parent?

Maybe I'm misunderstanding what you mean.

I would start out with a non-compliant parser, but one that allocates  
nothing beyond the I/O buffer, one that simply parses lazily and can be  
used as well as a SAX parser.  Then see how much extra allocations we need  
to get it to be compliant.  Then, one can choose the compliancy level  
based on what performance penalties one is willing to incur.

-Steve