High performance XML parser

Steven Schveighoffer schveiguy at yahoo.com
Mon Feb 7 04:36:17 PST 2011


On Fri, 04 Feb 2011 17:03:08 -0500, Simen kjaeraas  
<simen.kjaras at gmail.com> wrote:

> Steven Schveighoffer <schveiguy at yahoo.com> wrote:
>
>> Here is how I would approach it (without doing any research).
>>
>> First, we need a buffered I/O system where you can easily access and  
>> manipulate the buffer.  I have proposed one a few months ago in this NG.
>>
>> Second, I'd implement the XML lib as a range where "front()" gives you  
>> an XMLNode.  If the XMLNode is an element, it will have eager access to  
>> the element tag, and lazy access to the attributes and the sub-nodes.   
>> Each XMLNode will provide a forward range for the child nodes.
>>
>> Thus you can "skip" whole elements in the stream by popFront'ing a  
>> range, and dive deeper via accessing the nodes of the range.
>>
>> I'm unsure how well this will work, or if you can accomplish all of it  
>> without reallocation (in particular, you may need to store the element  
>> information, maybe via a specialized member function?).
>
> Question:
>
> For the lazily computed attributes and subnodes, will accessing one  
> element
> cause all elements to be computed? Same goes for getting the number of
> elements.

The goal is to avoid double-buffering data.  So you are using the buffer  
of the input stream to contain all data.  So, advancing to the 'next'  
element/node/attribute makes the previous element/node/attribute invalid  
(i.e. the buffer is reused).

The trick is to make it seem like the node is fully there without actually  
reading the stream until you need it (hence the lazy part), because  
reading the entire node means reading the entire file (in the case of the  
root element).

> Also, can this be efficiently combined with mmapping? What I sorta  
> imagine
> is a kind of lazy slice: It determines whether it ends within this page,  
> and
> if not, does not progress past that page until asked to do so.

mmaping would make things more accessible, but the common denominator is  
not mmap.  If it's supported as a special case, then maybe it can offer  
some interesting features, but something like mmap can't be done for say a  
network stream.

-Steve


More information about the Digitalmars-d mailing list