High performance XML parser
Steven Schveighoffer
schveiguy at yahoo.com
Wed Feb 9 05:37:56 PST 2011
On Tue, 08 Feb 2011 19:16:37 -0500, Tomek Sowiński <just at ask.me> wrote:
> Steven Schveighoffer napisał:
>
>> > The design I'm thinking is that the node iterator will own a buffer.
>> One
>> > consequence is that the fields of the current node will point to the
>> > buffer akin to foreach(line; File.byLine), so in order to lift the
>> input
>> > the user will have to dup (or process the node in-place). As new nodes
>> > will be overwritten on the same piece of memory, an important trait of
>> > the design emerges: cache intensity. Because of XML namespaces I think
>> > it is necessary for the buffer to contain the current node plus all
>> its
>> > parents.
>>
>> That might not scale well. For instance, if you are accessing the
>> 1500th
>> child element of a parent, doesn't that mean that the buffer must
>> contain
>> the full text for the previous 1499 elements in order to also contain
>> the
>> parent?
>>
>> Maybe I'm misunderstanding what you mean.
>
> Let's talk on an example:
>
> <a name="value">
> <b>
> Some Text 1
> <c2> <!-- HERE -->
> Some text 2
> </c2>
> Some Text 3
> </b>
> </a>
>
> The buffer of the iterator positioned HERE would be:
>
> [Node a | Node b | Node c2]
OK, so you mean a buffer other than the I/O buffer. This means double
buffering data. I was thinking of a solution that allows simply using the
I/O buffer for parsing. I think this is one of the keys to Tango's xml
performance.
-Steve
More information about the Digitalmars-d
mailing list