buffered input

Sat Feb 5 02:09:09 PST 2011

On Friday 04 February 2011 21:46:40 Andrei Alexandrescu wrote:
> I've had the opportunity today to put some solid hours of thinking into
> the relationship (better said the relatedness) of what would be called
> "buffered streams" and ranges. They have some commonalities and some
> differences, but it's been difficult to capture them. I think now I have
> a clear view, caused by a few recent discussions. One was the CSV reader
> discussed on the Phobos list; another was the discussion on defining the
> "right" std.xml.
> 
> First, let's start with the humblest abstraction of all - an input
> range, which only defines the troika empty/front/popFront with the known
> semantics.
> 
> An input range consumes input destructively and has a one-element
> horizon. It may as well considered a buffered stream with the buffer
> length exactly one.
> 
> Going from there, we may say that certain streaming can be done by using
> an input range of ubyte (or dchar for text). That would be the
> UTFpowered equivalent of getchar(). The readf function operates that way
> - it only needs to look one character ahead. Incidentally, the CSV
> format also requires lookahead of 1, so it also can operate on a range
> of dchar.
> 
> At this point we need to ask ourselves an essential question. Since we
> have this "input range" abstraction for a 1-element buffer, what would
> its n-elements buffer representation look like? How do we go from "input
> range of T" (which really is "unbuffered input range of T" to "buffered
> input range of T"?
> 
> Honestly, the answer was extremely unclear to me for the longest time. I
> thought that such a range would be an extension of the unbuffered one,
> e.g. a range that still offers T from front() but also offers some
> additional functions - e.g. a lookahead in the form of a random-access
> operator. I still think something can be defined along those lines, but
> today I came together with a design that is considerably simpler both
> for the client and the designer of the range.
> 
> I hereby suggest we define "buffered input range of T" any range R that
> satisfies the following conditions:
> 
> 1. R is an input range of T[]
> 
> 2. R defines a primitive shiftFront(size_t n). The semantics of the
> primitive is that, if r.front.length >= n, then shiftFront(n) discards
> the first n elements in r.front. Subsequently r.front will return a
> slice of the remaining elements.
> 
> 3. R defines a primitive appendToFront(size_t n). Semantics: adds at
> most n more elements from the underlying stream and makes them available
> in addition to whatever was in front. For example if r.front.length was
> 1024, after the call r.appendToFront(512) will have r.front have length
> 1536 of which the first 1024 will be the old front and the rest will be
> newly-read elements (assuming that the stream had enough data). If n =
> 0, this instructs the stream to add any number of elements at its own
> discretion.
> 
> This is it. I like many things about this design, although I still fear
> some fatal flaw may be found with it.
> 
> With these primitives a lot of good operating operating on buffered
> streams can be written efficiently. The range is allowed to reuse data
> in its buffers (unless that would contradict language invariants, e.g.
> if T is invariant), so if client code wants to stash away parts of the
> input, it needs to make a copy.
> 
> One great thing is that buffered ranges as defined above play very well
> with both ranges and built-in arrays - two quintessential parts of D. I
> look at this and say, "this all makes sense". For example the design
> could be generalized to operate on some random-access range other than
> the built-in array, but then I'm thinking, unless some advantage comes
> about, why not giving T[] a little special status? Probably everyone
> thinks of contiguous memory when thinking "buffers", so here
> generalization may be excessive (albeit meaningful).
> 
> Finally, this design is very easy to experiment with and causes no
> disruption to ranges. I can readily add the primitives to byLine and
> byChunk so we can see what streaming we can do that way.
> 
> What do you all think?

Hmm. I think that I'd have to have an actual implementation to mess around with 
to say much. My general take on buffered input is that I don't want to worry 
about it. I want it to be buffered so that it's more efficient, but I don't want to 
have to care about it in how I use it. I would have expected a buffered input 
range to be exactly the same as an input range except that it doesn't really 
just pull in one character behind the scenes. It pulls in 1024 or whatever when 
popFront() would result in the end of the buffer being reached, and you just get 
the first one with front. The API doesn't reflect the fact that it's buffered at 
all except perhaps in how you initialize it (by telling how big the buffer is, 
though generally I don't want to have to care about that either).

Now, there may be some sort of use case where you actually need to care about 
the buffering, so using buffered data in the way that I was thinking wouldn't 
really work. But if so, that's not the sort of use case that I normally run 
into.

Regardless, a more normal range could be built on top of what you're suggesting, 
and it could do essentially what I was thinking buffered ranges would do. So, 
perhaps doing what you're suggesting and building what I was thinking of on top 
of that would be the way to go. That way, if you actually care about messing 
with the buffer, you can, but if not, you just use it normally and the buffering 
is dealt with underneath.

- Jonathan M Davis