High performance XML parser
Michel Fortin
michel.fortin at michelf.com
Fri Feb 4 13:36:58 PST 2011
On 2011-02-04 16:02:39 -0500, Tomek Sowiński <just at ask.me> said:
> I am now intensely accumulating information on how to go about creating
> a high-performance parser as it quickly became clear that my old one
> won't deliver. And if anything is clear is that memory is the key.
>
> One way is the slicing approach mentioned on this NG, notably used by
> RapidXML. I already contacted Marcin (the author) to ensure that using
> solutions inspired by his lib is OK with him; it is. But I don't think
> I'll go this way. One reason is, surprisingly, performance. RapidXML
> cannot start parsing until the entire document is loaded and ready as a
> random-access string. Then it's blazingly fast but the time for I/O has
> already elapsed. Besides, as Marcin himself said, we need a 100%
> W3C-compliant implementation and RapidXML isn't one.
>
> I think a much more fertile approach is to operate on a forward range,
> perhaps assuming bufferized input. That way I can start parsing as soon
> as the first buffer gets filled. Not to mention that the end result
> will use much less memory. Plenty of the XML data stream is indents,
> spaces, and markup -- there's no reason to copy all this into memory.
>
> To sum up, I belive memory and overlapping I/O latencies with parsing
> effort are pivotal.
I agree it's important, especially when receiving XML over the network,
but I also think it's important to be able to be able to support
slicing. Imagine all the memory you could save by just making slices of
a memory-mapped file.
The difficulty is to support both models: the input range model which
requires copying the strings and the slicing model where you're just
taking slices of a string.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list