High performance XML parser

Fri Feb 11 05:12:38 PST 2011

On Mon, 07 Feb 2011 10:37:46 -0500, Robert Jacques wrote:

> On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer
> <schveiguy at yahoo.com> wrote:
> 
>> On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just at ask.me> wrote:
>>
>>> Steven Schveighoffer napisał:
>>>
>>>> Here is how I would approach it (without doing any research).
>>>>
>>>> First, we need a buffered I/O system where you can easily access and
>>>> manipulate the buffer.  I have proposed one a few months ago in this
>>>> NG.
>>>>
>>>> Second, I'd implement the XML lib as a range where "front()" gives
>>>> you an
>>>> XMLNode.  If the XMLNode is an element, it will have eager access to
>>>> the
>>>> element tag, and lazy access to the attributes and the sub-nodes. 
>>>> Each XMLNode will provide a forward range for the child nodes.
>>>>
>>>> Thus you can "skip" whole elements in the stream by popFront'ing a
>>>> range,
>>>> and dive deeper via accessing the nodes of the range.
>>>>

> 
> I would consider a tokenizer which can be used for SAX style parsing to
> be a key feature of std.xml. I know it was considered very important
> when I was gathering requirements for my std.JSON re-write.

XML parser
/dsource/xmlp/trunk/std

I have experimented with various means to balance efficiency and 
flexibility with XML parsing.

The core parsing uses an ItemReturn struct.
	This returns transient pointers to reused buffers, so that there 
is reduced memory buffer reallocation for just churning through the XML 
document.

struct ItemReturn {
	ItemType		type = ItemType.RET_NULL;
	char[]			scratch;
	char[][]		names;
	char[][]		values;
}

The central parse method fills the ItemReturn with transient tag names, 
and pointers to names and values, somewhat like a SAX parser.

To measuring performance components,  take the throughput of making a XML 
document using a linked DOM tree structure as 100%, with validation, 
attribute normalisation.

This implementation, buffers for file or string sources, I get this 
breakdown of processing. This is done on books.xml.   Other examples of 
documents, with different structure, entities, DTD, schema, namespaces, 
will differ.

Input overhead.
Throughput of examining each single Unicode character in the document, as 
a dchar value.   About 12-15% of time. So there is not a relatively great 
cost in the input buffering.

Tag, attribute and content throughput.
Parsing and filling the ItemReturn struct for each parse method call, 
called for every identifyable XML element token in the document sequence, 
about 60% of time, without doing anything with the result. This also 
includes Input Overhead.  No DOM structure is assumed, and their is no 
recursion.  The general sort of work done is keeping track of state, and 
assembling and returning the various types of tokens.

To actually build a full DOM, without doing much in the way of 
validation, and attribute normalisation, is up to about 86% of total time.
This includes converting the returned transient buffered values of tags, 
attributes names, values, and content into string, and the creation and 
linking of DOM nodes.   This involves some recursive method calls that 
mirror the XML document structure. 
Some time and memory seems to be saved by aliasing tag and attribute 
names using an AA. This takes about 85% of the full job.

Additional validation and attribute normalisation takes more time.