[RFC] I/O and Buffer Range

Sun Dec 29 14:02:19 PST 2013

Didn't mean this to be kind of under the tree .. Planned to post long 
ago but I barely had the time to flesh it out.

TL;DR: Solve the input side of I/O problem with new kind of ranges.
I'm taking on the buffering primitive on top of the low-level unbuffered 
I/O sources specifically.

Links

Prior related discussions:
http://forum.dlang.org/thread/op.wegg9vyneav7ka@steves-laptop?page=4

http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmars-d@puremagic.com?page=1

http://forum.dlang.org/thread/vnpriguleebpbzhkpdwi@forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com

An early precursor of the proposed design found in DScanner:
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388

Into and motivation

Anyone doing serious work on parsers (text & binary alike) with D soon 
finds out that while slice-em-up with random access ranges works very 
nice any attempts to extend that beyond arrays and in-memory stuff are 
getting real awkward. It sums up to implementing not a small amount of 
bookkeeping (buffering) all the while having to use primitives that just 
don't map well to the underlying "hardware" - buffered stream.

For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)? 
Save the range, do 3 front/popFront with empty checks then copy saved 
range back. Painful especially if you know full well that it must be 
just a length check plus 3 direct reads of array. Even if the underlying 
buffer allows you to slice up the contents and do array-wise operations 
on them, the dumbed-down forward range interface just won't let you.

And once all of the hard (and useless) work to support forward ranges is 
done - you get that you have this 2-nd parser that should support 
forward ranges as well. It leads to conclusion that parsers simply don't 
want to work forward ranges, they need something bigger and all extra 
work you did on buffering simply belongs to that bigger primitive.

To put the last nail in the coffin - most of interesting sources can't 
even be forward ranges! Stream over a TCP socket - how do you 'save' 
that, Sherlock? Keeping in mind to return the same type - we'd better 
called it "fork" back then.

To sum it up - the current selection of ranges is not good enough to 
_efficiently_ work with I/O. Yet ranges are very neat and we'd better 
retain their strengths and capitalize on the existing work done in this 
area (e.g. a slew of good stuff in std.algorithm and std.range). Got to 
extend the notion then.

Goals

Introduce a better primitive aimed squarely at solving the _input_ side 
of I/O problem. Output IMHO might need some tweaks but with OutputRange 
it's in a much better shape then input.

The said new input primitive (from now on "buffer range") must naturally 
and efficiently support the following:
1) Zero-copy insanely fast & popular slice em up approach for in-memory use.
2) "Over the wire" sources such as pipes, sockets and the like
3) Memory mapped files esp. the ones that don't fit into memory
4) Primitives that (all kinds of) parsers need, including lookahead and 
lookbehind.

It would be awesome if we can:
4) Retain backwards compatibility and integrate well with existing ranges.
5) Avoid dependency on C run-time and surpass it performance-wise
6) Remove extra burden and decouple dependencies in the chain:

input-source <--> buffer range <--> parser/consumer

Meaning that if we can mix and match parsers with buffer ranges, and 
buffer ranges with input sources we had grown something powerful indeed.

Spoiler

I have a proof of concept implemented. It *works*. What I'm looking for 
is to simplify the set of primitives and/or better suggestions.

Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer
Docs: http://blackwhale.github.io/datapicked/

See it in action with e.g. a bit trimmed-down std.regex using this 
abstraction and working directly over files:

grep-like tool
https://github.com/blackwhale/datapicked/blob/master/dgrep.d

and the module itself
https://github.com/blackwhale/datapicked/blob/master/regex.d

(for usage see build.sh and bench.sh)

It's as efficient as it was with arrays but no more need to work line by 
line and/or load the whole file.

In fact it's faster then the fastest line by line solution I had before: 
fgets + std.regex.matchFirst. Note that this (old) solution is cheating 
twice - it seeks only 1 match per line and it knows the line length 
ahead of time.

For me this proves that (5) is well within our reach.

Proposal

The BufferRange concept itself (for now called simply Buffer) is defined 
here:
http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I'm not comfortable with the sheer number of primitives but my goal was 
sufficient first, minimal second.

Rationale for the selection of the primitives follows.

1. Start with InputRange, as there is no need to break the good thing 
(and foreach). This gets us somewhat close to (4). :)

Accept that given requirements (1)-(3) we are working with "sliding 
window" over whatever is the true source of data. Thus a sliding window 
can be the whole array, a mapped area of a file or a buffer of network 
stream. A sliding window may be moved across the input stream (~ 
re-loading the buffer) or extended to reach further.

Now what we need is to properly exploit capabilities of sliding window 
model and match that with requirements a parser would "like".

2. Slicing "after the fact".

This means ability to mark relevant parts of buffer as a start of 
something "useful" and require the underlying implementation when the 
time comes to move the window to keep the data starting from here. 
Typically later "down the stream" when the boundaries of slice are 
established it's extracted, examined and (rarely!) copied over. Ability 
to avoid copy unless absolutely necessary is _crucial_.

Motivating (pseudo-)code I've seen in lexers (e.g. DScanner)
with my primitives looks like this:

{
   //pin this position so that buffering won't lose it
   auto m = input.mark();
   while(!input.empty && isAlpha(input.front)){
	input.popFront();
   }
   //get a slice from 'm' to current position
   auto word = input.slice(m);
   //grab a copy if haven't seen before
   if(word !in identifiers)
       identifiers.insert(word.idup);
} //here m's destructor unpins position in input

To address slicing (parser requirement) I had to introduce marking and 
pinning concept explicitly. See mark/slice in docs.

3. Cheap save/restore.

Copying the whole range to save iteration state was a bad idea. It's not 
just wasteful as in memory, it's _semantically_ costly. In case of 
buffer range it would also imply some "re-reading" of I/O source as the 
copy has to be completely independent view of the same input(!). Hence 
"save" of forward range is an inadequate primitive that must be dropped.

However when time comes to backtrack (and many parsers do that, quite 
often) the state has to be restored. To minimize primitive count and not 
break requirement (2) reuse 'mark' to save the state and add 'seek' to 
restore position to the marked point. As it was pinned it's always in 
the buffer and readily accessible, keeping the ability to work over streams.

4. Even cheaper save/restore.

Something discovered the hard way in the practical setting is that 
setting up boundaries of stuff to slice (captures in regex) must be dirt 
cheap, like integer assignment cheap.

This means always using marking for every sub-slice is prohibitively 
costly as it has to communicate with buffer (currently bump a counter in 
some array). The idea that worked out with std.regex is to use a single 
"vantage point" mark and take a big slice off that and then make 
sub-slices of it as the plain array it is. Then from time to time 
"vantage point" should be updated when there is no sub-slices in mid-air.

This leads to a 'tell' primitive that gives offset from a given mark to 
the current position plus 'seek' overloads that work with relative 
offsets (positive and negative).

Another usage for relative seek is skipping the of data without looking, 
potentially dropping the whole buffers away (there is overlap with 
popFrontN though). Also fixed-width backtracking is easily had with 
relative seek if we can instruct the buffer range to keep at least K 
last bytes in the buffer (require minimal history).

5. Cheap lookahead/lookbehind.

This exploits the fact that underlying buffer is nothing but array, 
hence one may easily take a peek at some portion of it as plain ubyte[]. 
Implementation makes sure it buffers up enough of data as required 
and/or returns empty range if not possible. This supports things like 
LL(k) lookahead and fixed-length lookbehind that is common in regex 
0-width assertions.

These 2 can be implemented with relative 'seek' +front/popFront, the 
question remains is how effective it'll be.

-- 
Dmitry Olshansky