buffered input

Fri Feb 4 22:41:39 PST 2011

Interesting.  I was just writing LazyMap and AsyncBuf in 
std.parallelism, and I ran into these exact issues.  (A LazyMap computes 
a map lazily and in parallel and stores the result in a buffer.  An 
AsyncBuf reads from an unbuffered input range in a background thread and 
buffers the results for when you need them.  I wanted to optimize 
chaining LazyMaps and AsyncBufs for pipelining parallelism.)

I solved them with very ad-hoc way with lots of static if statements and 
encapsulation violation within the module.  One thing that would make a 
more principled approach in std.parallelism possible is a swapBuffer() 
primitive.  You provide the range with a new buffer to fill, and it 
gives you complete ownership of the old buffer.  This is basically how 
LazyMap/AsyncBuf pipelining works under the hood, though I never gave 
any serious consideration to the more general case.

On 2/5/2011 12:46 AM, Andrei Alexandrescu wrote:
> I've had the opportunity today to put some solid hours of thinking into
> the relationship (better said the relatedness) of what would be called
> "buffered streams" and ranges. They have some commonalities and some
> differences, but it's been difficult to capture them. I think now I have
> a clear view, caused by a few recent discussions. One was the CSV reader
> discussed on the Phobos list; another was the discussion on defining the
> "right" std.xml.
>
> First, let's start with the humblest abstraction of all - an input
> range, which only defines the troika empty/front/popFront with the known
> semantics.
>
> An input range consumes input destructively and has a one-element
> horizon. It may as well considered a buffered stream with the buffer
> length exactly one.
>
> Going from there, we may say that certain streaming can be done by using
> an input range of ubyte (or dchar for text). That would be the
> UTFpowered equivalent of getchar(). The readf function operates that way
> - it only needs to look one character ahead. Incidentally, the CSV
> format also requires lookahead of 1, so it also can operate on a range
> of dchar.
>
> At this point we need to ask ourselves an essential question. Since we
> have this "input range" abstraction for a 1-element buffer, what would
> its n-elements buffer representation look like? How do we go from "input
> range of T" (which really is "unbuffered input range of T" to "buffered
> input range of T"?
>
> Honestly, the answer was extremely unclear to me for the longest time. I
> thought that such a range would be an extension of the unbuffered one,
> e.g. a range that still offers T from front() but also offers some
> additional functions - e.g. a lookahead in the form of a random-access
> operator. I still think something can be defined along those lines, but
> today I came together with a design that is considerably simpler both
> for the client and the designer of the range.
>
> I hereby suggest we define "buffered input range of T" any range R that
> satisfies the following conditions:
>
> 1. R is an input range of T[]
>
> 2. R defines a primitive shiftFront(size_t n). The semantics of the
> primitive is that, if r.front.length >= n, then shiftFront(n) discards
> the first n elements in r.front. Subsequently r.front will return a
> slice of the remaining elements.
>
> 3. R defines a primitive appendToFront(size_t n). Semantics: adds at
> most n more elements from the underlying stream and makes them available
> in addition to whatever was in front. For example if r.front.length was
> 1024, after the call r.appendToFront(512) will have r.front have length
> 1536 of which the first 1024 will be the old front and the rest will be
> newly-read elements (assuming that the stream had enough data). If n =
> 0, this instructs the stream to add any number of elements at its own
> discretion.
>
> This is it. I like many things about this design, although I still fear
> some fatal flaw may be found with it.
>
> With these primitives a lot of good operating operating on buffered
> streams can be written efficiently. The range is allowed to reuse data
> in its buffers (unless that would contradict language invariants, e.g.
> if T is invariant), so if client code wants to stash away parts of the
> input, it needs to make a copy.
>
> One great thing is that buffered ranges as defined above play very well
> with both ranges and built-in arrays - two quintessential parts of D. I
> look at this and say, "this all makes sense". For example the design
> could be generalized to operate on some random-access range other than
> the built-in array, but then I'm thinking, unless some advantage comes
> about, why not giving T[] a little special status? Probably everyone
> thinks of contiguous memory when thinking "buffers", so here
> generalization may be excessive (albeit meaningful).
>
> Finally, this design is very easy to experiment with and causes no
> disruption to ranges. I can readily add the primitives to byLine and
> byChunk so we can see what streaming we can do that way.
>
> What do you all think?
>
>
> Andrei