The XML module in Phobos

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Fri Jul 31 09:03:07 PDT 2009


Daniel Keep wrote:
> Andrei Alexandrescu wrote:
>> Interesting. Could you please give more details about this? Why is
>> range-based I/O a bad idea, and what can we do to make it a better one?
> 
> (A clarification: I *should* have said "...basing IO entirely on ranges
> is -probably- a bad idea".)
> 
> <rambling>
> 
> My concern is the interface.
> 
> Let's take a hypothetical input range that reads from a file.  Since
> we're parsing XML, we want it to be character data.  So the interface
> might look something like:
> 
> struct Stream(T)
> {
>     T front();
>     bool empty();
>     void next();
> }
> 
> (I realise I probably got at least one name wrong; I can't be bothered
> digging up the exact names, and it's irrelevant anyway :P)

Yah, we had to choose popFront instead of the shorter next because there 
was no obvious corresponding "txen" to extract the last element.

> My concern is that front returns T: a single character.
> 
> I wrote an archival tool many, many years ago in VB.  It worked by
> reading and writing a single byte at a time, and naturally performed
> shockingly.  I knew there had to be a faster way since other programs
> didn't crawl like mine was and discovered that reading/writing in larger
> blocks gave significantly better performance. [1]

I see, and I'm glad to dissipate this concern. There are three 
interfaces that Phobos will define: byChar, byLine, and byBlock. So you 
get to choose the transfer unit and transfer mechanism. (byLine allows 
you to choose the separator too.) Nowadays I use text files often so I 
use byLine. It's very rare that you want to process input one character 
at a time, and indeed it would suck if the infrastructure would insist 
that that's the unit of transfer.

> Much of the performance of Tango's IO system (and from the XML parsing
> code, too) is that it operates on big arrays wherever it can.  Hell, the
> pull parser is, as far as anyone is able to tell, faster than every
> other XML parser in existence specifically because it reads the whole
> file in one IO operation and then just deals with slices and array access.

(That's great, but isn't sometimes the file a socket stream?)

I don't see this approach clashing with ranges because arrays are ranges 
so this setup is very natural to implement with ranges.

> That's one half of my worry with this: that the range interface
> specifically precludes efficient batch operations.

Hope this went away.

> Another, somewhat smaller concern, is that the range interface is
> back-to-front for IO.
> 
> Consider a stream: you don't know if the stream is empty until you
> attempt to read past the end of it.  Standard input does this, network
> sockets do this... probably others.
 >
> But the range interface asks "is this empty?", which you can't answer
> until you attempt to read from it.  So to implement .empty for a
> hypothetical stdin range, you'd need to try reading past the current
> location.  If you get a character, you've just modified the underlying
> stream.

Yah, however note that if you subsequently copy the range, the 
already-read front is also copied so there's no loss. Problems appear if 
you create e.g. two input ranges from the same FILE* or socket or whatnot.

Walter and I discussed this problem for a long time. I also discussed 
the problem in the newsgroup. I argued that the simplest and most 
natural interface for a pure input stream has only one function getNext 
which at the same time gets the element and bumps the stream. 
Unfortunately, since all forward ranges are also input ranges, that 
interface must also work well for all other ranges (e.g. arrays), in 
which case it would be contorted. We decided to define what we now have.

> (Actually, this is more of a concern for me in any situation where
> computing the next element of a range is an expensive operation, or an
> operation with side-effects.  I had the same issue when attempting to
> bind coroutines to the opApply interface.  You had to eagerly compute
> the next value in order to answer the question: is there a next element?)

Yah but you can always cache the result of the computation. The 
remaining annoyance is that the side effect occurs earlier than you'd 
expect.

> Maybe these won't turn out to be problems in practice.  But my gut
> feeling is that IO would be better served by a Tango-style interface
> (putting the emphasis on efficient block transfers), with ranges
> wrapping that if you're willing to maybe take a performance hit.

I think we can do better by defining a general interface that will work 
for arrays as good as if hand-written.

The MSB is that ranges and block transfer are not at all in conflict.



Andrei



More information about the Digitalmars-d mailing list