High performance XML parser

Tue Feb 8 17:46:46 PST 2011

On 02/09/2011 01:16 AM, Tomek Sowiński wrote:
> Steven Schveighoffer napisał:
>
>>> The design I'm thinking is that the node iterator will own a buffer. One
>>> consequence is that the fields of the current node will point to the
>>> buffer akin to foreach(line; File.byLine), so in order to lift the input
>>> the user will have to dup (or process the node in-place). As new nodes
>>> will be overwritten on the same piece of memory, an important trait of
>>> the design emerges: cache intensity. Because of XML namespaces I think
>>> it is necessary for the buffer to contain the current node plus all its
>>> parents.
>>
>> That might not scale well.  For instance, if you are accessing the 1500th
>> child element of a parent, doesn't that mean that the buffer must contain
>> the full text for the previous 1499 elements in order to also contain the
>> parent?
>>
>> Maybe I'm misunderstanding what you mean.
>
> Let's talk on an example:
>
> <a name="value">
> 	<b>
> 		Some Text 1
> 		<c2>       <!-- HERE -->
> 		Some text 2
> 		</c2>
> 		Some Text 3
> 	</b>
> </a>
>
> The buffer of the iterator positioned HERE would be:
>
> [Node a | Node b | Node c2]
>
> Node c2 and all its parents are available for inspection. Node a's attribute is stored in the buffer, but not b's "Some text 1" as it is c2's sibling; "Some text 1" was available in the previous iteration, now it's overwritten by c2. To get to "Some text 2" let's advance the iterator in depth to get:
>
> [Node a | Node b | Node c2 | Text node "Some text 2"]
>
> Advancing it once more we get to:
>
> [Node a | Node b | Text node "Some text 3"]
>
> So "Some text 3" is written where c2 and the text node 2 used to be.

That's very similar to what I was thinking at. What I wonder is whether, in 
your buffer representations above, 'Node x' represents an instanciated node, or 
collected data necessary to later instanciate --once the current part of the 
source is validated (proved correct). What i mean is, the whole <a> node will 
be validated only when </a> is matched, so that if your parsing process fails 
on the way (and/or it was just following a wrong parsing path), then all nodes 
instanciated along tha way are just to throw away, aren't they (unless some 
memoisation may be useful).
Possibly all what say here is just stupid, depending on the parsing algo and 
nature of the grammar, and also how costly Node creation is. In my case, all of 
this seems relevant. Thus, I'm thinking at just collecting data along the way 
(rather easy & far cheaper than node construction), and (recursively) 
instanciate only once a section is validated. (needs to be tried concretely 
--maybe there are issues I'm not yet aware of)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com