High performance XML parser

Fri Feb 4 13:36:58 PST 2011

On 2011-02-04 16:02:39 -0500, Tomek Sowiński <just at ask.me> said:

> I am now intensely accumulating information on how to go about creating 
> a high-performance parser as it quickly became clear that my old one 
> won't deliver. And if anything is clear is that memory is the key.
> 
> One way is the slicing approach mentioned on this NG, notably used by 
> RapidXML. I already contacted Marcin (the author) to ensure that using 
> solutions inspired by his lib is OK with him; it is. But I don't think 
> I'll go this way. One reason is, surprisingly, performance. RapidXML 
> cannot start parsing until the entire document is loaded and ready as a 
> random-access string. Then it's blazingly fast but the time for I/O has 
> already elapsed. Besides, as Marcin himself said, we need a 100% 
> W3C-compliant implementation and RapidXML isn't one.
> 
> I think a much more fertile approach is to operate on a forward range, 
> perhaps assuming bufferized input. That way I can start parsing as soon 
> as the first buffer gets filled. Not to mention that the end result 
> will use much less memory. Plenty of the XML data stream is indents, 
> spaces, and markup -- there's no reason to copy all this into memory.
> 
> To sum up, I belive memory and overlapping I/O latencies with parsing 
> effort are pivotal.

I agree it's important, especially when receiving XML over the network, 
but I also think it's important to be able to be able to support 
slicing. Imagine all the memory you could save by just making slices of 
a memory-mapped file.

The difficulty is to support both models: the input range model which 
requires copying the strings and the slicing model where you're just 
taking slices of a string.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/