[RFC] I/O and Buffer Range

Dmitry Olshansky dmitry.olsh at gmail.com
Mon Dec 30 00:06:28 PST 2013


30-Dec-2013 02:45, Vladimir Panteleev пишет:
> On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
>> [snip]
>
> Hmm, just yesterday I was rewriting a parser to use a buffer instead of
> loading the whole file in memory, so this is quite timely for me.
>
> Questions:
>
> 1. What happens when the distance between the pinned and current
> position exceeds the size of the buffer (sliding window)? Is the buffer
> size increased, or is the stream rewound if possible and the range
> returned by the slice does seeking?

It's expected that the window is increased. The exact implementation may 
play any dirty tricks it sees fit as long as it can provide a slice over 
the pinned area. In short - maintain the illusion that the window has 
increased. I would be against seeking range and would most likely opt 
for memory-mapped files instead but it all depends on the exact numbers.

>
> 2. I don't understand the rationale behind the current semantics of
> lookahead/lookbehind. If you want to e.g. peek ahead/behind to find the
> first whitespace char, you don't know how many chars to request.

If you want to 'find' just do front/popFront, no?
Or do you specifically want to do array-wise operations?

> Wouldn't it be better to make these functions return the ENTIRE
> available buffer in O(1)?

Indeed, now I think that 2 overloads would be better:
auto lookahead(size_t n); //exactly n bytes, re-buffering as needed
auto lookahead(); // all that is available in the window, no re-buffering

Similar for lookbehind.

> I guess I see the point when applied to regular expressions, where the
> user explicitly specifies how many characters to look ahead/behind.

Actually the user doesn't - our lookahead/lookbehind is variable length. 
One thing I would have to drop is unbound lookbehind, not that it's so 
critical.

> However, I think in most use cases the amount is not known beforehand
> (without imposing arbitrary limitations on users like "Thou shalt not
> have variable identifiers longer than 32 characters"), so the pattern
> would be "try a cheap lookahead/behind, and if that fails, do an
> expensive one".

I would say that in case where you need arbitrary-length lookahead:
m = mark, seek + popFront x N, seek(m) should fit the bill.
Or as is the case in regex at the moment - mark once, and use seek back 
to some position relative to it. In one word - backtracking.

An example of where fixed lookahead rocks:
https://github.com/blackwhale/datapicked/blob/master/dpick/buffer/buffer.d#L421

>
> 3. I think ideally the final design would use something like what
> std.allocator does with "unbounded" and "chooseAtRuntime" - some uses
> might not need lookahead or lookbehind or other features at all, so
> having a way to disable the relevant code would benefit those cases.

It makes sense to make lookahead and lookbehind optional.
As for the code - for the moment it doesn't add much and builds on stuff 
already there. Though I suspect some other implementations would be able 
to "cut corners" more efficiently.

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list