[RFC] I/O and Buffer Range
Dmitry Olshansky
dmitry.olsh at gmail.com
Mon Dec 30 00:06:28 PST 2013
30-Dec-2013 02:45, Vladimir Panteleev пишет:
> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>> [snip]
>
> Hmm, just yesterday I was rewriting a parser to use a buffer instead of
> loading the whole file in memory, so this is quite timely for me.
>
> Questions:
>
> 1. What happens when the distance between the pinned and current
> position exceeds the size of the buffer (sliding window)? Is the buffer
> size increased, or is the stream rewound if possible and the range
> returned by the slice does seeking?
It's expected that the window is increased. The exact implementation may
play any dirty tricks it sees fit as long as it can provide a slice over
the pinned area. In short - maintain the illusion that the window has
increased. I would be against seeking range and would most likely opt
for memory-mapped files instead but it all depends on the exact numbers.
>
> 2. I don't understand the rationale behind the current semantics of
> lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the
> first whitespace char, you don't know how many chars to request.
If you want to 'find' just do front/popFront, no?
Or do you specifically want to do array-wise operations?
> Wouldn't it be better to make these functions return the ENTIRE
> available buffer in O(1)?
Indeed, now I think that 2 overloads would be better:
auto lookahead(size_t n); //exactly n bytes, re-buffering as needed
auto lookahead(); // all that is available in the window, no re-buffering
Similar for lookbehind.
> I guess I see the point when applied to regular expressions, where the
> user explicitly specifies how many characters to look ahead/behind.
Actually the user doesn't - our lookahead/lookbehind is variable length.
One thing I would have to drop is unbound lookbehind, not that it's so
critical.
> However, I think in most use cases the amount is not known beforehand
> (without imposing arbitrary limitations on users like "Thou shalt not
> have variable identifiers longer than 32 characters"), so the pattern
> would be "try a cheap lookahead/behind, and if that fails, do an
> expensive one".
I would say that in case where you need arbitrary-length lookahead:
m = mark, seek + popFront x N, seek(m) should fit the bill.
Or as is the case in regex at the moment - mark once, and use seek back
to some position relative to it. In one word - backtracking.
An example of where fixed lookahead rocks:
https://github.com/blackwhale/datapicked/blob/master/dpick/buffer/buffer.d#L421
>
> 3. I think ideally the final design would use something like what
> std.allocator does with "unbounded" and "chooseAtRuntime" - some uses
> might not need lookahead or lookbehind or other features at all, so
> having a way to disable the relevant code would benefit those cases.
It makes sense to make lookahead and lookbehind optional.
As for the code - for the moment it doesn't add much and builds on stuff
already there. Though I suspect some other implementations would be able
to "cut corners" more efficiently.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list