High performance XML parser

Mon Feb 7 20:01:30 PST 2011

On 2/4/2011 4:47 PM, Tomek Sowiński wrote:
> Michel Fortin napisał:
> 
>> I agree it's important, especially when receiving XML over the network, 
>> but I also think it's important to be able to be able to support 
>> slicing. Imagine all the memory you could save by just making slices of 
>> a memory-mapped file.
>>
>> The difficulty is to support both models: the input range model which 
>> requires copying the strings and the slicing model where you're just 
>> taking slices of a string.
> 
> These are valid concerns. Yet, in overwhelming majority XML documents come from hard-drive and network -- these are the places we need to drill. I fear that trying to cover every remote use case will render the library incomprehensible.
> 

This reminds me of some things I was thinking about when I worked with
some XML-heavy apps in Java and experimented with writing parsers for my
own simple markup languages.

If you have the entire XML string loaded in memory, the most
time-consuming part of parsing it is probably going to be allocation of
node objects. So it makes sense to do a quick scan of the char array,
and generate just the root node, which would lazily allocate sub-nodes
upon access.

I can see several different implementation of a high-performance parser,
depending on the typical use case. Do you want to work efficiently with
lots of small files or one huge file? Deeply nested or mostly flat?
Coming from memory or a stream of characters?

Problem is, with lazy parsing XML nodes would need to be able to call
upon the parser that created them.

Perhaps it would be possible to specify some kind of generic XML node
interface and allow people to use/generate different implementations
depending on what they need?