High performance XML parser

Wed Feb 9 05:37:56 PST 2011

On Tue, 08 Feb 2011 19:16:37 -0500, Tomek Sowiński <just at ask.me> wrote:

> Steven Schveighoffer napisał:
>
>> > The design I'm thinking is that the node iterator will own a buffer.  
>> One
>> > consequence is that the fields of the current node will point to the
>> > buffer akin to foreach(line; File.byLine), so in order to lift the  
>> input
>> > the user will have to dup (or process the node in-place). As new nodes
>> > will be overwritten on the same piece of memory, an important trait of
>> > the design emerges: cache intensity. Because of XML namespaces I think
>> > it is necessary for the buffer to contain the current node plus all  
>> its
>> > parents.
>>
>> That might not scale well.  For instance, if you are accessing the  
>> 1500th
>> child element of a parent, doesn't that mean that the buffer must  
>> contain
>> the full text for the previous 1499 elements in order to also contain  
>> the
>> parent?
>>
>> Maybe I'm misunderstanding what you mean.
>
> Let's talk on an example:
>
> <a name="value">
> 	<b>
> 		Some Text 1
> 		<c2>      <!-- HERE -->
> 		Some text 2
> 		</c2>
> 		Some Text 3
> 	</b>
> </a>
>
> The buffer of the iterator positioned HERE would be:
>
> [Node a | Node b | Node c2]

OK, so you mean a buffer other than the I/O buffer.  This means double  
buffering data.  I was thinking of a solution that allows simply using the  
I/O buffer for parsing.  I think this is one of the keys to Tango's xml  
performance.

-Steve