The XML module in Phobos

Daniel Keep daniel.keep.lists at gmail.com
Thu Jul 30 23:26:34 PDT 2009


Andrei Alexandrescu wrote:
> Daniel Keep wrote:
>> ...

>> Of course, most people HATE this method because it requires you to write
>> mountains of boilerplate code.  Pity, then, it's also the fastest and
>> most flexible.  :P  (It's a pity D doesn't have extension methods since
>> then you could probably do something along the lines of LINQ to make the
>> whole thing utterly painless... but then, I've given up on waiting for
>> that.)
>>
>> This is basically the only way to map xml parsing to ranges.  As for
>> CONSUMING ranges, I think that'd be a bad idea for the same reason
>> basing IO entirely on ranges is a bad idea.
> 
> Interesting. Could you please give more details about this? Why is
> range-based I/O a bad idea, and what can we do to make it a better one?

(A clarification: I *should* have said "...basing IO entirely on ranges
is -probably- a bad idea".)

<rambling>

My concern is the interface.

Let's take a hypothetical input range that reads from a file.  Since
we're parsing XML, we want it to be character data.  So the interface
might look something like:

struct Stream(T)
{
    T front();
    bool empty();
    void next();
}

(I realise I probably got at least one name wrong; I can't be bothered
digging up the exact names, and it's irrelevant anyway :P)

My concern is that front returns T: a single character.

I wrote an archival tool many, many years ago in VB.  It worked by
reading and writing a single byte at a time, and naturally performed
shockingly.  I knew there had to be a faster way since other programs
didn't crawl like mine was and discovered that reading/writing in larger
blocks gave significantly better performance. [1]

Much of the performance of Tango's IO system (and from the XML parsing
code, too) is that it operates on big arrays wherever it can.  Hell, the
pull parser is, as far as anyone is able to tell, faster than every
other XML parser in existence specifically because it reads the whole
file in one IO operation and then just deals with slices and array access.

That's one half of my worry with this: that the range interface
specifically precludes efficient batch operations.

Another, somewhat smaller concern, is that the range interface is
back-to-front for IO.

Consider a stream: you don't know if the stream is empty until you
attempt to read past the end of it.  Standard input does this, network
sockets do this... probably others.

But the range interface asks "is this empty?", which you can't answer
until you attempt to read from it.  So to implement .empty for a
hypothetical stdin range, you'd need to try reading past the current
location.  If you get a character, you've just modified the underlying
stream.

(Actually, this is more of a concern for me in any situation where
computing the next element of a range is an expensive operation, or an
operation with side-effects.  I had the same issue when attempting to
bind coroutines to the opApply interface.  You had to eagerly compute
the next value in order to answer the question: is there a next element?)

Maybe these won't turn out to be problems in practice.  But my gut
feeling is that IO would be better served by a Tango-style interface
(putting the emphasis on efficient block transfers), with ranges
wrapping that if you're willing to maybe take a performance hit.

</rambling>

Just my exceedingly verbose AU$0.02.

> And what's the way that avoids writing boilerplate code but is slower?
> Is that the method that calls virtual functions (or delegates) upon each
> element received?

(Deleted lots of rambling)

The problem with calling a delegate for every element received is that
all the interfaces that do this suck.  SAX is the prime example of this.

Looking at stuff like Rx
(http://themechanicalbride.blogspot.com/2009/07/introducing-rx-linq-to-events.html),
I'm convinced there must be a way of doing it WELL.  I just don't know
what it is yet.

> ...


[1] I learned so much more back then when I had NO idea what I was
doing, and thus made lots of mistakes.  Sadly, I have a strong physical
aversion to making mistakes, so now I don't take risks.  And because I
know I know I don't like taking risks, I can't trick myself into taking
them.  Curse my endlessly recursive consciousness!



More information about the Digitalmars-d mailing list