High performance XML parser

Jacob Carlborg doob at me.com
Sun Feb 6 02:48:09 PST 2011


On 2011-02-04 22:02, Tomek Sowiński wrote:
> I am now intensely accumulating information on how to go about creating a high-performance parser as it quickly became clear that my old one won't deliver. And if anything is clear is that memory is the key.
>
> One way is the slicing approach mentioned on this NG, notably used by RapidXML. I already contacted Marcin (the author) to ensure that using solutions inspired by his lib is OK with him; it is. But I don't think I'll go this way. One reason is, surprisingly, performance. RapidXML cannot start parsing until the entire document is loaded and ready as a random-access string. Then it's blazingly fast but the time for I/O has already elapsed. Besides, as Marcin himself said, we need a 100% W3C-compliant implementation and RapidXML isn't one.
>
> I think a much more fertile approach is to operate on a forward range, perhaps assuming bufferized input. That way I can start parsing as soon as the first buffer gets filled. Not to mention that the end result will use much less memory. Plenty of the XML data stream is indents, spaces, and markup -- there's no reason to copy all this into memory.
>
> To sum up, I belive memory and overlapping I/O latencies with parsing effort are pivotal.
>
> Please comment on this.

I don't think it's up to the parser to decide where the content comes 
from. It should be able to handle the whole content of an XML file in 
memory.

-- 
/Jacob Carlborg


More information about the Digitalmars-d mailing list