std.xml and Adam D Ruppe's dom module

Wed Feb 8 00:12:57 PST 2012

Am Tue, 07 Feb 2012 20:44:08 -0500
schrieb "Jonathan M Davis" <jmdavisProg at gmx.com>:

> On Tuesday, February 07, 2012 00:56:40 Adam D. Ruppe wrote:
> > On Monday, 6 February 2012 at 23:47:08 UTC, Jonathan M Davis
> > 
> > wrote:
> > > Also, two of the major requirements for an improved std.xml are
> > > that it needs to have a range-based API, and it needs to be
> > > fast.
> > 
> > What does range based API mean in this context? I do offer
> > a couple ranges over the tree, but it really isn't the main
> > thing there.
> > 
> > Check out Element.tree() for the main one.
> > 
> > 
> > But, if you mean taking a range for input, no, doesn't
> > do that. I've been thinking about rewriting the parse
> > function (if you look at it, you'll probably hate it
> > too!). But, what I have works and is tested on a variety
> > of input, including garbage that was a pain to get working
> > right, so I'm in no rush to change it.
> > 
> > > Tango's XML parser has pretty much set the bar on speed
> > 
> > Yeah, I'm pretty sure Tango whips me hard on speed. I spent
> > some time in the profiler a month or two ago and got a
> > significant speedup over the datasets I use (html files),
> > but I'm sure there's a whole lot more that could be done.
> > 
> > 
> > 
> > The biggest thing is I don't think you could use my parse
> > function as a stream.
> 
> Ideally, std.xml would operate of ranges of dchar (but obviously be
> optimized for strings, since there are lots of optimizations that can
> be done with string processing - at least as far as unicode goes) and
> it would return a range of some kind. The result would probably be a
> document type of some kind which provided a range of its top level
> nodes (or maybe just the root node) which each then provided ranges
> over their sub-nodes, etc. At least, that's the kind of thing that I
> would expect. Other calls on the document and nodes may not be
> range-based at all (e.g. xpaths should probably be supported, and
> that doesn't necessarily involve ranges). The best way to handle it
> all would probably depend on the implementation. I haven't
> implemented a full-blown XML parser, so I don't know what the best
> way to go about it would be, but ideally, you'd be able to process
> the nodes as a range.
> 
> - Jonathan M Davis

Using ranges of dchar directly can be horribly inefficient in some
cases, you'll need at least some kind off buffered dchar range. Some
std.json replacement code tried to use only dchar ranges and had to
reassemble strings character by character using Appender. That sucks
especially if you're only interested in a small part of the data and
don't care about the rest.
So for pull/sax parsers: Use buffering, return strings(better:
w/d/char[]) as slices to that buffer. If the user needs to keep a
string, he can still copy it. (String decoding should also be done
on-demand only).