[RFC] I/O and Buffer Range
Dmitry Olshansky
dmitry.olsh at gmail.com
Sun Dec 29 14:02:19 PST 2013
Didn't mean this to be kind of under the tree .. Planned to post long
ago but I barely had the time to flesh it out.
TL;DR: Solve the input side of I/O problem with new kind of ranges.
I'm taking on the buffering primitive on top of the low-level unbuffered
I/O sources specifically.
Links
Prior related discussions:
http://forum.dlang.org/thread/op.wegg9vyneav7ka@steves-laptop?page=4
http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmars-d@puremagic.com?page=1
http://forum.dlang.org/thread/vnpriguleebpbzhkpdwi@forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com
An early precursor of the proposed design found in DScanner:
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388
Into and motivation
Anyone doing serious work on parsers (text & binary alike) with D soon
finds out that while slice-em-up with random access ranges works very
nice any attempts to extend that beyond arrays and in-memory stuff are
getting real awkward. It sums up to implementing not a small amount of
bookkeeping (buffering) all the while having to use primitives that just
don't map well to the underlying "hardware" - buffered stream.
For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)?
Save the range, do 3 front/popFront with empty checks then copy saved
range back. Painful especially if you know full well that it must be
just a length check plus 3 direct reads of array. Even if the underlying
buffer allows you to slice up the contents and do array-wise operations
on them, the dumbed-down forward range interface just won't let you.
And once all of the hard (and useless) work to support forward ranges is
done - you get that you have this 2-nd parser that should support
forward ranges as well. It leads to conclusion that parsers simply don't
want to work forward ranges, they need something bigger and all extra
work you did on buffering simply belongs to that bigger primitive.
To put the last nail in the coffin - most of interesting sources can't
even be forward ranges! Stream over a TCP socket - how do you 'save'
that, Sherlock? Keeping in mind to return the same type - we'd better
called it "fork" back then.
To sum it up - the current selection of ranges is not good enough to
_efficiently_ work with I/O. Yet ranges are very neat and we'd better
retain their strengths and capitalize on the existing work done in this
area (e.g. a slew of good stuff in std.algorithm and std.range). Got to
extend the notion then.
Goals
Introduce a better primitive aimed squarely at solving the _input_ side
of I/O problem. Output IMHO might need some tweaks but with OutputRange
it's in a much better shape then input.
The said new input primitive (from now on "buffer range") must naturally
and efficiently support the following:
1) Zero-copy insanely fast & popular slice em up approach for in-memory use.
2) "Over the wire" sources such as pipes, sockets and the like
3) Memory mapped files esp. the ones that don't fit into memory
4) Primitives that (all kinds of) parsers need, including lookahead and
lookbehind.
It would be awesome if we can:
4) Retain backwards compatibility and integrate well with existing ranges.
5) Avoid dependency on C run-time and surpass it performance-wise
6) Remove extra burden and decouple dependencies in the chain:
input-source <--> buffer range <--> parser/consumer
Meaning that if we can mix and match parsers with buffer ranges, and
buffer ranges with input sources we had grown something powerful indeed.
Spoiler
I have a proof of concept implemented. It *works*. What I'm looking for
is to simplify the set of primitives and/or better suggestions.
Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer
Docs: http://blackwhale.github.io/datapicked/
See it in action with e.g. a bit trimmed-down std.regex using this
abstraction and working directly over files:
grep-like tool
https://github.com/blackwhale/datapicked/blob/master/dgrep.d
and the module itself
https://github.com/blackwhale/datapicked/blob/master/regex.d
(for usage see build.sh and bench.sh)
It's as efficient as it was with arrays but no more need to work line by
line and/or load the whole file.
In fact it's faster then the fastest line by line solution I had before:
fgets + std.regex.matchFirst. Note that this (old) solution is cheating
twice - it seeks only 1 match per line and it knows the line length
ahead of time.
For me this proves that (5) is well within our reach.
Proposal
The BufferRange concept itself (for now called simply Buffer) is defined
here:
http://blackwhale.github.io/datapicked/dpick.buffer.traits.html
I'm not comfortable with the sheer number of primitives but my goal was
sufficient first, minimal second.
Rationale for the selection of the primitives follows.
1. Start with InputRange, as there is no need to break the good thing
(and foreach). This gets us somewhat close to (4). :)
Accept that given requirements (1)-(3) we are working with "sliding
window" over whatever is the true source of data. Thus a sliding window
can be the whole array, a mapped area of a file or a buffer of network
stream. A sliding window may be moved across the input stream (~
re-loading the buffer) or extended to reach further.
Now what we need is to properly exploit capabilities of sliding window
model and match that with requirements a parser would "like".
2. Slicing "after the fact".
This means ability to mark relevant parts of buffer as a start of
something "useful" and require the underlying implementation when the
time comes to move the window to keep the data starting from here.
Typically later "down the stream" when the boundaries of slice are
established it's extracted, examined and (rarely!) copied over. Ability
to avoid copy unless absolutely necessary is _crucial_.
Motivating (pseudo-)code I've seen in lexers (e.g. DScanner)
with my primitives looks like this:
{
//pin this position so that buffering won't lose it
auto m = input.mark();
while(!input.empty && isAlpha(input.front)){
input.popFront();
}
//get a slice from 'm' to current position
auto word = input.slice(m);
//grab a copy if haven't seen before
if(word !in identifiers)
identifiers.insert(word.idup);
} //here m's destructor unpins position in input
To address slicing (parser requirement) I had to introduce marking and
pinning concept explicitly. See mark/slice in docs.
3. Cheap save/restore.
Copying the whole range to save iteration state was a bad idea. It's not
just wasteful as in memory, it's _semantically_ costly. In case of
buffer range it would also imply some "re-reading" of I/O source as the
copy has to be completely independent view of the same input(!). Hence
"save" of forward range is an inadequate primitive that must be dropped.
However when time comes to backtrack (and many parsers do that, quite
often) the state has to be restored. To minimize primitive count and not
break requirement (2) reuse 'mark' to save the state and add 'seek' to
restore position to the marked point. As it was pinned it's always in
the buffer and readily accessible, keeping the ability to work over streams.
4. Even cheaper save/restore.
Something discovered the hard way in the practical setting is that
setting up boundaries of stuff to slice (captures in regex) must be dirt
cheap, like integer assignment cheap.
This means always using marking for every sub-slice is prohibitively
costly as it has to communicate with buffer (currently bump a counter in
some array). The idea that worked out with std.regex is to use a single
"vantage point" mark and take a big slice off that and then make
sub-slices of it as the plain array it is. Then from time to time
"vantage point" should be updated when there is no sub-slices in mid-air.
This leads to a 'tell' primitive that gives offset from a given mark to
the current position plus 'seek' overloads that work with relative
offsets (positive and negative).
Another usage for relative seek is skipping the of data without looking,
potentially dropping the whole buffers away (there is overlap with
popFrontN though). Also fixed-width backtracking is easily had with
relative seek if we can instruct the buffer range to keep at least K
last bytes in the buffer (require minimal history).
5. Cheap lookahead/lookbehind.
This exploits the fact that underlying buffer is nothing but array,
hence one may easily take a peek at some portion of it as plain ubyte[].
Implementation makes sure it buffers up enough of data as required
and/or returns empty range if not possible. This supports things like
LL(k) lookahead and fixed-length lookbehind that is common in regex
0-width assertions.
These 2 can be implemented with relative 'seek' +front/popFront, the
question remains is how effective it'll be.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list