stream == range ? [Sliding window]

Sun May 31 10:50:15 PDT 2015

On 31-May-2015 18:58, Andrei Alexandrescu wrote:
> On 5/31/15 8:44 AM, Nick Sabalausky wrote:
>> On 05/30/2015 07:06 AM, short2cave wrote:
>>> Do the classic streams still make sense when we have Ranges ?
>>>
>>
>> I've been thinking the same thing lately. In fact, I'd been meaning to
>> make a post regarding that.
>>
>> Phobos's std.stream has been slated for an overhaul for awhile now, but
>> it seems to me that ranges are already 90% of the way to BEING a total
>> std.stream replacement:
>>
>> Output Streams: AFAICT, 100% covered by output ranges. Output streams
>> exist as a place for sticking arbitrary amounts of sequential data.
>> Output range's "put" does exactly that.
>>
>> Input Streams: Input ranges are very nearly a match for this. AFAICT,
>> The only thing missing here is the ability to "read" not just the one
>> "front" value, but to read the front N values as a chunk, without an
>> O(n) sequence of front/popFront. So we'd just need another "optional"
>> range characteristic: hasFrontN (or some such).
>
> Given that it seems Design by Introspection has been working well for us
> and we're continuing to enhance its use in Phobos, it seems to me that
> optional methods for ranges are the way to go.
>
> An optional method for any range is
>
> size_t bulkRead(T[] target);
>
> which fills as much as possible from target and returns the number of
> items copied.
>
> Another good candidate for optional methods is lookahead.

Hardly. In fact, I've spent quite some time trying to figure this out.

Doing bulk read to some array is the pushing burden of maintaining some 
buffer on the user and adding the overhead of extra copy on buffered 
streams. Not to mention that the more methods we put in range API the 
more if/else forests we produce in our algorithms.

For low-level (as in bytes) range-based I/O to be feasible at minimum 3 
things should be solved:

1. There must be a way to look at underlying buffer w/o copying anything.
Typical use cases in parser:
	- calling memchr or any array-based search without copying elements
	- lookaheads of arbitrary size
	- slicing to get identifiers - we only need to copy if it's new

Anything that does extra copy will put us out of high-perf territory.

2. If we were to reuse algorithms - most algorithms love ForwardRange. 
And there is a problem implementing it for streams in the _range_ API 
itself.

Yeah, most streams are seekable like e.g. files or MM-files, so they 
must be ForwardRanges... And indeed saving position is trivial.

Now look at the signature for ForwardRange:
struct Range{
	...
	@property Range save();
}

Yeah, you'd gotta duplicate the whole stream to count as ForwardRange. 
Yicks! Signatures that might work are:

Mark save();
void restore(Mark m);

3. Extending on the 1&2 above - efficient way to slice data out of 
seekable stream using some saved "marks" on it.

...

Now forget ranges for a second, let's focus on what parsing/tokenizing 
algorithms really need - direct access to buffer, cheap slicing and 
(almost arbitrary) lookahead.

In reality most stream-based parsing (in general - processing) works on 
a form of sliding window abstraction. It gets implemented time and time 
again with different adhoc buffering solutions.

Here is a rough API that works way better then range API for parsing:

strcut SlidingWindow(T){
	// get T's in current window
	T[] window();
	// moves window by n T's over the input
	// shrinks window closers to the end
	size_t skip(size_t n);
	// extend current window to have at least sz T's
	// returns new size
	size_t extend(size_t sz);
}

When window.length == 0 means that stream is exhausted.
Processing moves window by calling skip and along extending it's size.
On most streams skip invalidates previous slices of window, this may be 
a trait.

Generalizing this further we may make window return a sliceable RA-range.

One thing I know for certain it fits perfectly with std.regex and 
backtracking parsers like PEGs.

The only problems left are on the semantic side of working on streams. 
e.g.  regex '.*' would try to extend the window over the whole stream 
somewhat unexpectedly for the most users.

-- 
Dmitry Olshansky