std.xml: Why is it so slow? Is there anything else wrong with it?

Sun Mar 13 01:18:57 PST 2011

On Sunday 13 March 2011 01:11:05 Russel Winder wrote:
> On Sat, 2011-03-12 at 23:34 -0500, dsimcha wrote:
> > There seems to be a consensus around here that Phobos needs a good XML
> > module, and that std.xml doesn't cut it, at least partly due to
> > performance issues.  I have no clue how to write a good XML module from
> > scratch.  It seems like noone else is taking up the project either.
> 
> I just worry that creating a whole self-standing library is a waste of
> time when wrapping libxml2 and libxslt gets a fast XML subsystem for
> free.  This is the direction Python has gone. cf.  the lxml package to
> replace ElementTree.  The elephant in the room is of course W3C DOM.
> Everyone believes they have to have an implementation, but no-one then
> uses it.
> 
> > This leads me to two questions:
> > 
> > 1.  Has anyone ever sat down and tried to figure out **why** std.xml is
> > so slow?  Seriously, if noone's bothered to profile it or read the code
> > carefully, then for all we know there might be some low hanging fruit
> > and it might be an afternoon of optimization away from being reasonably
> > fast.  Basically every experience I've ever had suggests that, if a
> > piece of code has not already been profiled and heavily optimized, at
> > least a 5-fold speedup can almost always be obtained just by optimizing
> > the low-hanging fruit.  (For example, see my recent pull request for the
> > D garbage collector.  BTW, if excessive allocations are a contributing
> > factor, then fixing the GC should help with XML, too.)
> > 
> > If the answer is no, this hasn't been done, please post some canned
> > benchmarks and maybe I'll take a crack at it.
> > 
> > 2.  What other major defects/design flaws, if any, does std.xml have?
> > 
> > In other words, how are we really so sure that we need to start from
> > scratch?
> 
> Excellent question.  Especially given the existence of libxml2 and
> libxslt.

Well, Tom is working a new std.xml regardless, but I would fully expect a 
properly implemented xml library in D to cream something like libxml. D's 
slicing abilities give it a _huge_ advantage when it comes to stuff like parsing. 
libxml isn't going to be able to take advantage of that. Tango's XML parser is 
_extremely_ fast ( http://dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-
fast-at-parsing-xml/ ), and one of the biggest reasons for that is D's slicing 
abilities. Parsing is one place where D should be able to seriously shine and is 
_definitely_ one of the places where we _don't_ want to wrap a C library if we 
don't have to.

But regardless, a new std.xml is in the works, and hopefully it'll be up for 
review within the next couple of months (I have no idea how fast Tom is making 
progress on it though).

- Jonathan M Davis