High performance XML parser

Mon Feb 7 07:37:46 PST 2011

On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer  
<schveiguy at yahoo.com> wrote:

> On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just at ask.me> wrote:
>
>> Steven Schveighoffer napisał:
>>
>>> Here is how I would approach it (without doing any research).
>>>
>>> First, we need a buffered I/O system where you can easily access and
>>> manipulate the buffer.  I have proposed one a few months ago in this  
>>> NG.
>>>
>>> Second, I'd implement the XML lib as a range where "front()" gives you  
>>> an
>>> XMLNode.  If the XMLNode is an element, it will have eager access to  
>>> the
>>> element tag, and lazy access to the attributes and the sub-nodes.  Each
>>> XMLNode will provide a forward range for the child nodes.
>>>
>>> Thus you can "skip" whole elements in the stream by popFront'ing a  
>>> range,
>>> and dive deeper via accessing the nodes of the range.
>>>
>>> I'm unsure how well this will work, or if you can accomplish all of it
>>> without reallocation (in particular, you may need to store the element
>>> information, maybe via a specialized member function?).
>>
>> Heh, yesterday when I couldn't sleep I was sketching the design. I  
>> converged to a pretty much same concept, so your comment is reassuring  
>> :).
>>
>> The design I'm thinking is that the node iterator will own a buffer.  
>> One consequence is that the fields of the current node will point to  
>> the buffer akin to foreach(line; File.byLine), so in order to lift the  
>> input the user will have to dup (or process the node in-place). As new  
>> nodes will be overwritten on the same piece of memory, an important  
>> trait of the design emerges: cache intensity. Because of XML namespaces  
>> I think it is necessary for the buffer to contain the current node plus  
>> all its parents.
>
> That might not scale well.  For instance, if you are accessing the  
> 1500th child element of a parent, doesn't that mean that the buffer must  
> contain the full text for the previous 1499 elements in order to also  
> contain the parent?
>
> Maybe I'm misunderstanding what you mean.
>
> I would start out with a non-compliant parser, but one that allocates  
> nothing beyond the I/O buffer, one that simply parses lazily and can be  
> used as well as a SAX parser.  Then see how much extra allocations we  
> need to get it to be compliant.  Then, one can choose the compliancy  
> level based on what performance penalties one is willing to incur.
>
> -Steve

I would consider a tokenizer which can be used for SAX style parsing to be  
a key feature of std.xml. I know it was considered very important when I  
was gathering requirements for my std.JSON re-write.