High performance XML parser
Robert Jacques
sandford at jhu.edu
Mon Feb 7 07:37:46 PST 2011
On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer
<schveiguy at yahoo.com> wrote:
> On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just at ask.me> wrote:
>
>> Steven Schveighoffer napisał:
>>
>>> Here is how I would approach it (without doing any research).
>>>
>>> First, we need a buffered I/O system where you can easily access and
>>> manipulate the buffer. I have proposed one a few months ago in this
>>> NG.
>>>
>>> Second, I'd implement the XML lib as a range where "front()" gives you
>>> an
>>> XMLNode. If the XMLNode is an element, it will have eager access to
>>> the
>>> element tag, and lazy access to the attributes and the sub-nodes. Each
>>> XMLNode will provide a forward range for the child nodes.
>>>
>>> Thus you can "skip" whole elements in the stream by popFront'ing a
>>> range,
>>> and dive deeper via accessing the nodes of the range.
>>>
>>> I'm unsure how well this will work, or if you can accomplish all of it
>>> without reallocation (in particular, you may need to store the element
>>> information, maybe via a specialized member function?).
>>
>> Heh, yesterday when I couldn't sleep I was sketching the design. I
>> converged to a pretty much same concept, so your comment is reassuring
>> :).
>>
>> The design I'm thinking is that the node iterator will own a buffer.
>> One consequence is that the fields of the current node will point to
>> the buffer akin to foreach(line; File.byLine), so in order to lift the
>> input the user will have to dup (or process the node in-place). As new
>> nodes will be overwritten on the same piece of memory, an important
>> trait of the design emerges: cache intensity. Because of XML namespaces
>> I think it is necessary for the buffer to contain the current node plus
>> all its parents.
>
> That might not scale well. For instance, if you are accessing the
> 1500th child element of a parent, doesn't that mean that the buffer must
> contain the full text for the previous 1499 elements in order to also
> contain the parent?
>
> Maybe I'm misunderstanding what you mean.
>
> I would start out with a non-compliant parser, but one that allocates
> nothing beyond the I/O buffer, one that simply parses lazily and can be
> used as well as a SAX parser. Then see how much extra allocations we
> need to get it to be compliant. Then, one can choose the compliancy
> level based on what performance penalties one is willing to incur.
>
> -Steve
I would consider a tokenizer which can be used for SAX style parsing to be
a key feature of std.xml. I know it was considered very important when I
was gathering requirements for my std.JSON re-write.
More information about the Digitalmars-d
mailing list